A Survey of Available Corpora for Building Data-Driven Dialogue Systems

Iulian Vlad Serban; Joelle Pineau; Laurent Charlin; Peter Henderson; Ryan Lowe

arxiv: 1512.05742 · v3 · pith:VK5WG4WGnew · submitted 2015-12-17 · 💻 cs.CL · cs.AI· cs.HC· cs.LG· stat.ML

A Survey of Available Corpora for Building Data-Driven Dialogue Systems

Iulian Vlad Serban , Ryan Lowe , Peter Henderson , Laurent Charlin , Joelle Pineau This is my paper

classification 💻 cs.CL cs.AIcs.HCcs.LGstat.ML

keywords data-drivendialoguesystemsdatasetslearningareaavailablediscuss

0 comments

read the original abstract

During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Convex Low-resource Accent-Robust Language Detection in Speech Recognition
cs.LG 2026-05 unverdicted novelty 5.0

CLD integrates convex optimization and ADMM in JAX to deliver 97-98% accuracy for language detection robust to accents under low-resource conditions, with claimed theoretical stability guarantees.