Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
super hub Mixed citations
Tetreault , title =
Mixed citation behavior. Most common role is background (50%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background comment-reply dataset for (dis) agreement detection in online debates. InThirty-fifth conference on neural information processing systems datasets and bench- marks track (round 2). Miklos Z Rácz and Daniel E Rigobon. 2023. Towards consensus: Reducing polarization by perturbing so- cial networks.IEEE Transactions on Network Sci- ence and Engineering, 10(6):3450-3464. ZP Rosen and Rick Dale. 2025. Antisemitic and islamophobic hate speech precedes a decrease in lexico-semantic diversity in comment
- background 2005. Ha- hacronym: A computational humor system. InPro- ceedings of the ACL Interactive Poster and Demon- stration Sessions, pages 113-116. David Tomás, Reynier Ortega-Bueno, Guobiao Zhang, Paolo Rosso, and Rossano Schifanella. 2023. Transformer-based models for multimodal irony de- tection.Journal of Ambient Intelligence and Human- ized Computing, 14(6):7399-7410. Robert West and Eric Horvitz. 2019. Reverse- engineering satire, or "paper on computational hu- mor accepted despite making serious
- background We define N scales with two adapter sets: G= {G1, . . . ,GN } (MGFA) and C={C 1, . . . ,CN } (MCFA). At each scale n, features are reshaped to a grid X (0) v ∈R H×W×D v and downsampled by Down(·,2 n−1): X (n) v = Down(X(0) v ,2 n−1).(4) Let Xv,n = Seq(X (n) v ) denote the flattened se- quence. We then refine and fuse: Gn =G n(Xv,n), C n =C n(Xv,n, Xt),(5) ˜Xv,n =G n +w C n,(6) where w balances global and cross-modal adapta- tion. An interleave-repeat upsampling restores the (a) MGFA Module. (b)
- background Householder mean-direction alignment.The nuisance mean-direction difference is removed by mapping the sample mean direction of X onto that of Y via Householder reflection. Let ¯x= 1 n Pn i=1 xi, ¯y= 1 m Pm j=1 yj, ˆµx = ¯x ∥¯x∥2 , ˆµy = ¯y ∥¯y∥2 . If ˆµx ̸= ˆµy, the Householder axis is defined as u= ˆµx − ˆµy ∥ˆµx − ˆµy∥2 ,(5) and the reflection matrix is H=I−2uu ⊤,(6) which satisfies Hˆµx = ˆµy and H⊤H=I . We then alignXby applyingHto every vector inX: x′ i =Hx i (i= 1, . . . , n),(7) and Y is
- other t→1 as the query requires more changes, thus (1−t)→1 as the query increases in accuracy. 3.6 Query Mutation Given the mutation temperaturet and assessment A from the critic, the original candidate QC is then rewritten via LLMmutate, which is prompted to produce an updated query candidate QC′ that in- corporates the changes recommended by the critic: QC′ =LLM mutate(Q, S′ i, QC, H, A, t)(6) We consider a single refinement step to consist of a call to the critic, followed by a subsequent call to t
- background contribution of Q and P without the CoT rationale. Correspondingly, al no-CoT represents the attention activation excluding CoT. The additional term WV R(WKR)T q represents the contribution of the CoT rationale R to the hid- den activation. We can get the hidden activation by transforming the attention activation by a non- linear functionf: hl ≈h l no-CoT +f WV R(WKR)T q (7) Thus, we conclude that the rationale R in the CoT primarily contributes a shift in hidden acti- vation values, emphasi
authors
co-cited works
representative citing papers
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
LongBEL improves biomedical entity linking consistency by combining full-document context with memory of previous predictions trained via cross-validation rather than gold labels.
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
Semantic Softmax aggregates probabilities from semantic synonyms around target labels to correct renormalization bias in zero-shot LLM classification, lowering calibration error and raising AUROC and F1.
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
A new permutation test uses Householder reflection to align word embedding clouds before testing dispersion differences, cutting Type-I error by 32.5% and speeding up 23x on GPU.
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
S²R² improves robustness of LoRA-tuned LLMs to prompt perturbations by penalizing semantic-segment drift while preserving clean performance and cross-dataset transfer.
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
SpanDec achieves competitive NER accuracy with improved efficiency by using a final-stage lightweight decoder for span representations and early candidate filtering to reduce redundant computation.
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.
SG-RAG frames retrieval as subgraph matching to ensure LLMs meet every condition in factual queries and reports large gains over baselines on a new 120k-pair ERQA dataset.
MAGEO is a multi-agent system that distills validated editing patterns into reusable optimization skills for generative engines, outperforming heuristic baselines on visibility and fidelity via a new benchmark and evaluation protocol.
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
citing papers explorer
- Token-weighted Direct Preference Optimization with Attention
- PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning
- Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation
- Language-Switching Triggers Take a Latent Detour Through Language Models
- An Efficient Streaming Video Understanding Framework with Agentic Control
- Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
- BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps
- KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models
- Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning