AI models lose over 40% accuracy following multiple constraints in long multi-turn conversations and over 11% even with a single constraint as length increases, per the new SEQUOR benchmark.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Realsim shows simulated users fail to reproduce communication frictions present in real multi-turn chatbot dialogues, yielding overly optimistic evaluations with domain-dependent variability.
citing papers explorer
-
SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following
AI models lose over 40% accuracy following multiple constraints in long multi-turn conversations and over 11% even with a single constraint as length increases, per the new SEQUOR benchmark.
-
Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations
Realsim shows simulated users fail to reproduce communication frictions present in real multi-turn chatbot dialogues, yielding overly optimistic evaluations with domain-dependent variability.