Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
hub
Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
WhiteTesseract deploys XR-based diminished reality and LLM dialogue in a Monet exhibition, raising average viewing time from 35.3 to 98.3 seconds and shifting 60% of 529 interactions toward analytical and emotional queries.
DebiasRAG uses a three-stage RAG process to generate and rerank query-specific debiasing contexts that act as fairness constraints for LLM outputs.
A user study with 20 participants shows that closeness between sketches, annotations, and language in a shared space helps disambiguate multimodal queries, leading to the concept of proximity semantics for data exploration systems.
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
citing papers explorer
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
-
Chain-of-Verification Reduces Hallucination in Large Language Models
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
-
WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI
WhiteTesseract deploys XR-based diminished reality and LLM dialogue in a Monet exhibition, raising average viewing time from 35.3 to 98.3 seconds and shifting 60% of 529 interactions toward analytical and emotional queries.
-
DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation
DebiasRAG uses a three-stage RAG process to generate and rerank query-specific debiasing contexts that act as fairness constraints for LLM outputs.
-
From 'Here' to 'There': Exploring Proximity Semantics in Multimodal Data Exploration
A user study with 20 participants shows that closeness between sketches, annotations, and language in a shared space helps disambiguate multimodal queries, leading to the concept of proximity semantics for data exploration systems.
-
Retrieval-Augmented Generation for AI-Generated Content: A Survey
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.