pith. sign in

arxiv: 1907.03020 · v1 · pith:YM76K64Ynew · submitted 2019-07-05 · 💻 cs.CL · cs.AI

Towards Universal Dialogue Act Tagging for Task-Oriented Dialogues

Pith reviewed 2026-05-25 02:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords dialogue actstask-oriented dialoguesuniversal schemaannotation alignmentsemi-supervised tagginghuman-human conversationsdialogue act tagging
0
0 comments X

The pith

A universal dialogue act schema aligns existing datasets so a single tagger can label human-human task-oriented conversations at 54% F1 without new annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to cut the expense of labeling large volumes of human-human dialogue data by defining one shared schema for dialogue acts and mapping several already-annotated task-oriented datasets onto it. Manual mappings plus automated alignment let the authors pool the data and train a Universal DA tagger. When this tagger is applied to unlabeled human-human conversations it reaches 54.1% F1 on system turns; adding a modest amount of target-domain data raises the score to 57.7%, a level that would otherwise demand at least 1.7K fresh manual labels. The same approach yields further gains when unlabeled or labeled data from a new domain becomes available.

Core claim

The authors define a Universal DA schema for task-oriented dialogues, align multiple existing annotated datasets to it through a combination of manual and automated methods, and train a Universal DA tagger (U-DAT) on the resulting pooled data. Applied to human-human dialogues, the tagger obtains 54.1% F1 on system turns in a fully unsupervised setting and 57.7% F1 in a semi-supervised setting that would otherwise require at least 1.7K manually annotated turns; performance improves further when unlabeled or labeled target-domain data is supplied.

What carries the argument

The Universal DA schema together with the manual-plus-automated alignment procedure that converts labels from prior datasets into the common schema.

If this is right

  • Labeled task-oriented datasets can be reused across projects instead of being discarded when schemas differ.
  • New domains or customer-care logs become usable for training with far less than 1.7K new annotations.
  • Performance on target human-human data rises when even modest amounts of unlabeled or labeled target data are added.
  • The same aligned resource supports both human-machine and human-human tagging tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The alignment technique could be applied to other dialogue annotations such as slot values or user intents.
  • If the universal schema proves stable, the tagger might serve as a starting point for open-domain dialogue labeling.
  • Real-world customer logs could be run through the tagger to measure how often its output matches downstream system actions.

Load-bearing premise

Existing dialogue-act schemas can be mapped onto the proposed universal schema without introducing enough label noise or systematic bias to undermine training and evaluation on human-human data.

What would settle it

Collect a fresh sample of human-human turns, have experts label them directly with the universal schema, and check whether the tagger's F1 on those turns falls substantially below the reported 54.1% unsupervised figure.

read the original abstract

Machine learning approaches for building task-oriented dialogue systems require large conversational datasets with labels to train on. We are interested in building task-oriented dialogue systems from human-human conversations, which may be available in ample amounts in existing customer care center logs or can be collected from crowd workers. Annotating these datasets can be prohibitively expensive. Recently multiple annotated task-oriented human-machine dialogue datasets have been released, however their annotation schema varies across different collections, even for well-defined categories such as dialogue acts (DAs). We propose a Universal DA schema for task-oriented dialogues and align existing annotated datasets with our schema. Our aim is to train a Universal DA tagger (U-DAT) for task-oriented dialogues and use it for tagging human-human conversations. We investigate multiple datasets, propose manual and automated approaches for aligning the different schema, and present results on a target corpus of human-human dialogues. In unsupervised learning experiments we achieve an F1 score of 54.1% on system turns in human-human dialogues. In a semi-supervised setup, the F1 score increases to 57.7% which would otherwise require at least 1.7K manually annotated turns. For new domains, we show further improvements when unlabeled or labeled target domain data is available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a Universal Dialogue Act (DA) schema for task-oriented dialogues, aligns multiple existing annotated human-machine dialogue datasets to this schema via manual and automated methods, trains a Universal DA Tagger (U-DAT), and evaluates it on a target corpus of human-human dialogues. It reports an unsupervised F1 of 54.1% on system turns that rises to 57.7% in a semi-supervised setting, claiming this performance would otherwise require at least 1.7K manual annotations.

Significance. If the alignment step produces labels free of substantial noise, the work would demonstrate a practical route to bootstrap DA taggers for human-human data from cheaper existing human-machine corpora, directly addressing the annotation bottleneck for task-oriented dialogue systems.

major comments (1)
  1. [Alignment methods section] The section describing the manual and automated schema alignment provides no validation metrics (inter-annotator agreement, held-out accuracy against gold universal labels, or error analysis of collapsed acts). Because the headline unsupervised (54.1%) and semi-supervised (57.7%) F1 scores on the human-human target corpus are obtained by training on the aligned labels, the absence of any quality check on the alignment is load-bearing for the central claim.
minor comments (1)
  1. [Abstract and Experiments] The abstract states headline F1 numbers without reference to model architecture, validation splits, or confidence intervals; these details should appear in the main experimental section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Alignment methods section] The section describing the manual and automated schema alignment provides no validation metrics (inter-annotator agreement, held-out accuracy against gold universal labels, or error analysis of collapsed acts). Because the headline unsupervised (54.1%) and semi-supervised (57.7%) F1 scores on the human-human target corpus are obtained by training on the aligned labels, the absence of any quality check on the alignment is load-bearing for the central claim.

    Authors: We agree that explicit validation metrics for the alignment are necessary to support the central claim. In the revised manuscript we will report inter-annotator agreement on a sampled subset of the manual alignments, held-out accuracy of the automated alignment against a small set of gold universal labels, and a qualitative error analysis of the collapsed acts. These additions will be placed in the alignment section and will be independent of the downstream tagging results. revision: yes

Circularity Check

0 steps flagged

No circularity: performance metrics are independent empirical outcomes

full rationale

The paper defines a universal DA schema, describes manual/automated alignment of prior datasets to it, trains U-DAT on the aligned data, and reports F1 scores (54.1% unsupervised, 57.7% semi-supervised) on a separate target human-human corpus. None of these steps reduce by construction to the reported numbers; the F1 values are downstream results of training and evaluation rather than re-statements of fitted inputs, self-citations, or renamed patterns. The alignment step is a preprocessing choice whose correctness is an external assumption, not a definitional loop inside the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the premise that dialogue act labels from heterogeneous datasets can be mapped to a single schema without critical loss of meaning; this premise is introduced in the abstract but not evidenced within the provided text.

axioms (1)
  • domain assumption Existing dialogue act annotation schemas from multiple datasets can be aligned to a single universal schema without substantial information loss or bias
    The paper states it aligns datasets with the proposed schema; this alignment step is required for the combined training data to be usable.
invented entities (1)
  • Universal DA schema no independent evidence
    purpose: To provide a common label set that multiple prior datasets can be mapped onto
    The schema is introduced by the authors as the foundation for alignment and tagging.

pith-pipeline@v0.9.0 · 5753 in / 1442 out tokens · 33451 ms · 2026-05-25T02:00:43.831165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

  1. [1]

    DAs have been investi- gated by dialogue researchers for many years [2] and multi- ple taxonomies have been proposed [3, 4, 5] (see [6] for a review)

    Introduction Dialogue acts (DAs) aim to portray the meaning of utterances at the level of illocutionary force, capturing a speaker’s inten- tion in producing that utterance [1]. DAs have been investi- gated by dialogue researchers for many years [2] and multi- ple taxonomies have been proposed [3, 4, 5] (see [6] for a review). Recent work in task-oriented...

  2. [2]

    Towards Universal Dialogue Act Tagging for Task-Oriented Dialogues

    Related Work The mismatch between multiple DA taxonomies has been iden- tified by [6] previously, where a subset of ISO 24617-2 (the in- ternational ISO standard for DA annotation) tags [17] have been identified and annotations of multiple corpora were mapped to this set, focusing on social conversations. Our work has a simi- lar goal, but focuses on DAs th...

  3. [3]

    D = u1, u2, ..., uN and A be the predefined set of M DAs i.e

    DA Tagging for Dialogue Systems Let a dialogue D with N turns be denoted as a series of user and system utterances, ui, i.e. D = u1, u2, ..., uN and A be the predefined set of M DAs i.e. A = a1, a2..., aM . Given an utterance ui and its conversation history, DA tagging aims to predict the set of DAs Ai ⊂ A of ui. We use a deep neural network based model fo...

  4. [4]

    A bi-directional LSTM to encode each utterance, ui = wi 1, ..., wi ni, where ni denotes the number of tokens in ui and the final utterance representation for utterance zi is obtained by concatenating the last hidden layer of the forward LSTM, − − − − →LST M, and the first hidden layer of the backward LSTM, ← − − − −LST M: zi = ← − − − −LST M (ui) ⊕ − − − − ...

  5. [5]

    A hierarchical, uni-directional LSTM to encode the dia- logue level information, ei: ei = LST M (z1, ...zi−1)

  6. [6]

    An indicator number, gi, representing whether the agent is user or the system, i.e., gi = 0 , if uagent i = user, gi = 1 otherwise

  7. [7]

    A DA vector is represented as a many-hot vector di of dimension M, where we mark the true DAs as 1

    Encoding over past DA(s) pi, where the final represen- tation is obtained by concatenating the many-hot repre- sentations of past-DAs. A DA vector is represented as a many-hot vector di of dimension M, where we mark the true DAs as 1. pi = d1 ⊕ d2 ⊕ ... ⊕ di−1 The final encoded contextCi is given by: Ci = ei ⊕ gi ⊕ pi (2) Ci is then fed into a feed forward ...

  8. [8]

    There- fore, we need a unified representation of all the acts present across the datasets

    Datasets and Experiments Our aim is to train a Universal DA tagger using public datasets, but the label spaces across these datasets are not aligned. There- fore, we need a unified representation of all the acts present across the datasets. We obtain this representation by manu- ally going through the datasets and aligning semantically simi- lar sentences ...

  9. [9]

    for DAs. The GSim data has two parts and was collected by generating dialogue flows for movie (GSim-M) and restaurant (GSim-R) booking domains, where the individual turns from simulation in terms of DAs and associated arguments were then converted to natural language by crowd workers. DSTC2 con- tains human-machine interactions collected for the second di-...

  10. [10]

    embeddings and fine-tune during training

  11. [11]

    Universal DA Schema 5.1. Union of acts based on namespace In order to align the respective acts in the datasets (GSim and DSTC2), we first took a union of all the acts based on their names to create a unified representation. Figure 1 repre- Figure 1: Distribution of system acts across datasets Table 2: Examples of manual alignment of acts in all datasets. G...

  12. [12]

    We merge these acts

    Mod1: offer/select- I found a show for 7.30 pm/I found shows for 5 pm and 7 pm. We merge these acts

  13. [13]

    Mod2: user-request/sys-request - What is the phone number?/What kind of food would you like? We merge these acts

  14. [14]

    ‘yes, 7pm’ can become affirm, inform(time=7pm) from affirm(time=7pm)

    Mod3: affirm(x=y)/affirm + inform(x=y) - affirm with slots is equivalent to separate affirm and inform DAs, for eg. ‘yes, 7pm’ can become affirm, inform(time=7pm) from affirm(time=7pm). We split them

  15. [15]

    We merged/split DAs like the aforementioned ones, as they can easily be restored using other information

    Mod4: reqalts/reqmore - Is there anything else?/Can i help you with anything else? We merge these acts. We merged/split DAs like the aforementioned ones, as they can easily be restored using other information. For example, if mul- tiple results are offered, we could convert anoffer act to a select act, or depending on the agent, we can convert a request a...

  16. [16]

    This ver- sion of the dataset only has DAs for the system turns

    DA Tagging of Human-Human Datasets For experimenting with DA annotation of human-human (HH) dialogues, we used MultiWOZ-2.0[13] as our dataset. This ver- sion of the dataset only has DAs for the system turns. To do an evaluation on MultiWOZ-2.0, we first need to 1Details in Appendix, Table 7 Table 5: Universal DA schema ack, affirm, bye, deny, inform, repea...

  17. [17]

    In this work, we investigated multi- ple annotated human-machine conversation datasets, with dif- ferences in DA schema

    Conclusions We are interested in DA tagging of human-human conversations with the final goal of end-to-end training of task-oriented di- alogue systems, so that we can generate system actions for a given dialogue context. In this work, we investigated multi- ple annotated human-machine conversation datasets, with dif- ferences in DA schema. We discussed ma...

  18. [18]

    J. L. Austin, How to do things with words . Oxford university press, 1975

  19. [19]

    Dialogue act modeling for automatic tagging and recognition of conversational speech,

    A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Ju- rafsky, P. Taylor, R. Martin, C. V . Ess-Dykema, and M. Meteer, “Dialogue act modeling for automatic tagging and recognition of conversational speech,” Computational linguistics, vol. 26, no. 3, pp. 339–373, 2000

  20. [20]

    The hcrc map task corpus,

    A. H. Anderson, M. Bader, E. G. Bard, E. Boyle, G. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Milleret al., “The hcrc map task corpus,” Language and speech, vol. 34, no. 4, pp. 351–366, 1991

  21. [21]

    Coding dialogs with the damsl anno- tation scheme,

    M. G. Core and J. Allen, “Coding dialogs with the damsl anno- tation scheme,” in AAAI fall symposium on communicative action in humans and machines, vol. 56. Boston, MA, 1997

  22. [22]

    The dit++ taxonomy for functional dialogue markup,

    H. Bunt, “The dit++ taxonomy for functional dialogue markup,” in AAMAS 2009 Workshop, Towards a Standard Markup Lan- guage for Embodied Dialogue Acts, 2009, pp. 13–24

  23. [23]

    ISO-Standard Domain-Independent Dialogue Act Tagging for Conversational Agents

    S. Mezza, A. Cervone, G. Tortoreto, E. A. Stepanov, and G. Ric- cardi, “Iso-standard domain-independent dialogue act tagging for conversational agents,”arXiv preprint arXiv:1806.04327, 2018

  24. [24]

    Cued standard dialogue acts,

    S. Young, “Cued standard dialogue acts,” Report, Cambridge Uni- versity Engineering Department, 14th October, vol. 2007, 2007

  25. [25]

    Towards an iso standard for dialogue act annotation,

    H. Bunt, J. Alexandersson, J. Carletta, J.-W. Choe, A. C. Fang, K. Hasida, K. Lee, V . Petukhova, A. Popescu-Belis, L. Romary et al. , “Towards an iso standard for dialogue act annotation,” in Seventh conference on International Language Resources and Evaluation (LREC’10), 2010

  26. [26]

    Bootstrapping a neural conversational agent with dialogue self-play, crowdsourc- ing and on-line reinforcement learning,

    P. Shah, D. Hakkani-T ¨ur, B. Liu, and G. Tur, “Bootstrapping a neural conversational agent with dialogue self-play, crowdsourc- ing and on-line reinforcement learning,” in Proceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technolo- gies (NAACL/HLT), vol. 3, 2018, pp. 41–51

  27. [27]

    Back-off action selection in summary space-based pomdp dialogue systems,

    M. Ga ˇsi´c, F. Lefevre, F. Jurˇc´ıˇcek, S. Keizer, F. Mairesse, B. Thom- son, K. Yu, and S. Young, “Back-off action selection in summary space-based pomdp dialogue systems,” in IEEE Workshop on Au- tomatic Speech Recognition & Understanding. IEEE, 2009, pp. 456–461

  28. [28]

    Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning,

    J. D. Williams, K. Asadi, and G. Zweig, “Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning,” in 55th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL), 2017

  29. [29]

    Dialogue learning with human teaching and feedback in end-to-end train- able task-oriented dialogue systems,

    B. Liu, G. Tur, D. Hakkani-T ¨ur, P. Shah, and L. Heck, “Dialogue learning with human teaching and feedback in end-to-end train- able task-oriented dialogue systems,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT), 2018

  30. [30]

    Multiwoz-a large-scale multi- domain wizard-of-oz dataset for task-oriented dialogue mod- elling,

    P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, S. Ultes, . O. Ramadan, and M. Ga ˇsi´c, “Multiwoz-a large-scale multi- domain wizard-of-oz dataset for task-oriented dialogue mod- elling,” arXiv preprint arXiv:1810.00278, 2018

  31. [31]

    Edina: Building an open domain socialbot with self-dialogues,

    B. Krause, M. Damonte, M. Dobre, D. Duma, J. Fainberg, F. Fan- cellu, E. Kahembwe, J. Cheng, and B. Webber, “Edina: Building an open domain socialbot with self-dialogues,” Alexa Prize Pro- ceedings, 2017

  32. [32]

    Switchboard: Telephone speech corpus for research and development,

    J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” in icassp. IEEE, 1992, pp. 517–520

  33. [33]

    The second di- alog state tracking challenge,

    M. Henderson, B. Thomson, and J. D. Williams, “The second di- alog state tracking challenge,” in Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2014, pp. 263–272

  34. [34]

    The semantics of dialogue acts,

    H. Bunt, “The semantics of dialogue acts,” in Proceedings of the Ninth International Conference on Computational Semantics. Association for Computational Linguistics, 2011, pp. 1–13

  35. [35]

    Automatic dialog act seg- mentation and classification in multiparty meetings,

    J. Ang, Y . Liu, and E. Shriberg, “Automatic dialog act seg- mentation and classification in multiparty meetings,” in Proceed- ings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. , vol. 1. IEEE, 2005, pp. I–1061

  36. [36]

    Joint segmentation and classification of dialog acts using conditional random fields,

    M. Zimmermann, “Joint segmentation and classification of dialog acts using conditional random fields,” inTenth Annual Conference of the International Speech Communication Association, 2009

  37. [37]

    Dialog act tagging using graphical models,

    G. Ji and J. Bilmes, “Dialog act tagging using graphical models,” in Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. , vol. 1. IEEE, 2005, pp. I–33

  38. [38]

    Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks

    J. Y . Lee and F. Dernoncourt, “Sequential short-text classification with recurrent and convolutional neural networks,”arXiv preprint arXiv:1603.03827, 2016

  39. [39]

    Using context information for dialog act classification in dnn framework,

    Y . Liu, K. Han, Z. Tan, and Y . Lei, “Using context information for dialog act classification in dnn framework,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2170–2178

  40. [40]

    The icsi meeting recorder dialog act (mrda) corpus,

    E. Shriberg, R. Dhillon, S. Bhagat, J. Ang, and H. Carvey, “The icsi meeting recorder dialog act (mrda) corpus,” in Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT- NAACL 2004, 2004

  41. [41]

    Domain adaptation with unlabeled data for dialog act tagging,

    A. Margolis, K. Livescu, and M. Ostendorf, “Domain adaptation with unlabeled data for dialog act tagging,” in Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing. Association for Computational Linguistics, 2010, pp. 45–52

  42. [42]

    Enrich- ing word vectors with subword information,

    P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich- ing word vectors with subword information,” Transactions of the Association for Computational Linguistics , vol. 5, pp. 135–146, 2017. A. Appendix Table 7: Alignment of Datasets with Universal DA Schema GSim-R GSim-M DSTC2 Universal DA Schema inform(x=y) inform(x=y) inform(x=y) inform(x=y) reque...