Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
hub
BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
PMVisAgent uses multi-turn progressive interactions and a validation agent with ReAct-style verification to achieve up to 23.21% higher execution accuracy on the new PMVisBench dataset for text-to-vis tasks.
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
DebiasRAG uses a three-stage RAG process to generate and rerank query-specific debiasing contexts that act as fairness constraints for LLM outputs.
A user study with 20 participants shows that closeness between sketches, annotations, and language in a shared space helps disambiguate multimodal queries, leading to the concept of proximity semantics for data exploration systems.
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
StreamMUSE performs frame-synchronous streaming inference for language models by having a client send high-frequency requests and a server return outputs aligned to an external clock, shown on live music accompaniment with open-source code.
WhiteTesseract integrates XR diminished reality and LLM dialogue to increase viewing duration and interaction depth in physical cultural heritage exhibitions, shown in a 26-participant Monet exhibition study with statistically significant results.
citing papers explorer
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Towards Reliable Agentic Progressive Text-to-Visualization with Verification Rules
PMVisAgent uses multi-turn progressive interactions and a validation agent with ReAct-style verification to achieve up to 23.21% higher execution accuracy on the new PMVisBench dataset for text-to-vis tasks.
-
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
-
Chain-of-Verification Reduces Hallucination in Large Language Models
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
-
DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation
DebiasRAG uses a three-stage RAG process to generate and rerank query-specific debiasing contexts that act as fairness constraints for LLM outputs.
-
From 'Here' to 'There': Exploring Proximity Semantics in Multimodal Data Exploration
A user study with 20 participants shows that closeness between sketches, annotations, and language in a shared space helps disambiguate multimodal queries, leading to the concept of proximity semantics for data exploration systems.
-
Real-Time Language Model Jamming: A Case Study for Live Music Accompaniment Generation
StreamMUSE performs frame-synchronous streaming inference for language models by having a client send high-frequency requests and a server return outputs aligned to an external clock, shown on live music accompaniment with open-source code.
-
WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI
WhiteTesseract integrates XR diminished reality and LLM dialogue to increase viewing duration and interaction depth in physical cultural heritage exhibitions, shown in a 26-participant Monet exhibition study with statistically significant results.