Towards Universal Dialogue Act Tagging for Task-Oriented Dialogues
Pith reviewed 2026-05-25 02:00 UTC · model grok-4.3
The pith
A universal dialogue act schema aligns existing datasets so a single tagger can label human-human task-oriented conversations at 54% F1 without new annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors define a Universal DA schema for task-oriented dialogues, align multiple existing annotated datasets to it through a combination of manual and automated methods, and train a Universal DA tagger (U-DAT) on the resulting pooled data. Applied to human-human dialogues, the tagger obtains 54.1% F1 on system turns in a fully unsupervised setting and 57.7% F1 in a semi-supervised setting that would otherwise require at least 1.7K manually annotated turns; performance improves further when unlabeled or labeled target-domain data is supplied.
What carries the argument
The Universal DA schema together with the manual-plus-automated alignment procedure that converts labels from prior datasets into the common schema.
If this is right
- Labeled task-oriented datasets can be reused across projects instead of being discarded when schemas differ.
- New domains or customer-care logs become usable for training with far less than 1.7K new annotations.
- Performance on target human-human data rises when even modest amounts of unlabeled or labeled target data are added.
- The same aligned resource supports both human-machine and human-human tagging tasks.
Where Pith is reading between the lines
- The alignment technique could be applied to other dialogue annotations such as slot values or user intents.
- If the universal schema proves stable, the tagger might serve as a starting point for open-domain dialogue labeling.
- Real-world customer logs could be run through the tagger to measure how often its output matches downstream system actions.
Load-bearing premise
Existing dialogue-act schemas can be mapped onto the proposed universal schema without introducing enough label noise or systematic bias to undermine training and evaluation on human-human data.
What would settle it
Collect a fresh sample of human-human turns, have experts label them directly with the universal schema, and check whether the tagger's F1 on those turns falls substantially below the reported 54.1% unsupervised figure.
read the original abstract
Machine learning approaches for building task-oriented dialogue systems require large conversational datasets with labels to train on. We are interested in building task-oriented dialogue systems from human-human conversations, which may be available in ample amounts in existing customer care center logs or can be collected from crowd workers. Annotating these datasets can be prohibitively expensive. Recently multiple annotated task-oriented human-machine dialogue datasets have been released, however their annotation schema varies across different collections, even for well-defined categories such as dialogue acts (DAs). We propose a Universal DA schema for task-oriented dialogues and align existing annotated datasets with our schema. Our aim is to train a Universal DA tagger (U-DAT) for task-oriented dialogues and use it for tagging human-human conversations. We investigate multiple datasets, propose manual and automated approaches for aligning the different schema, and present results on a target corpus of human-human dialogues. In unsupervised learning experiments we achieve an F1 score of 54.1% on system turns in human-human dialogues. In a semi-supervised setup, the F1 score increases to 57.7% which would otherwise require at least 1.7K manually annotated turns. For new domains, we show further improvements when unlabeled or labeled target domain data is available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Universal Dialogue Act (DA) schema for task-oriented dialogues, aligns multiple existing annotated human-machine dialogue datasets to this schema via manual and automated methods, trains a Universal DA Tagger (U-DAT), and evaluates it on a target corpus of human-human dialogues. It reports an unsupervised F1 of 54.1% on system turns that rises to 57.7% in a semi-supervised setting, claiming this performance would otherwise require at least 1.7K manual annotations.
Significance. If the alignment step produces labels free of substantial noise, the work would demonstrate a practical route to bootstrap DA taggers for human-human data from cheaper existing human-machine corpora, directly addressing the annotation bottleneck for task-oriented dialogue systems.
major comments (1)
- [Alignment methods section] The section describing the manual and automated schema alignment provides no validation metrics (inter-annotator agreement, held-out accuracy against gold universal labels, or error analysis of collapsed acts). Because the headline unsupervised (54.1%) and semi-supervised (57.7%) F1 scores on the human-human target corpus are obtained by training on the aligned labels, the absence of any quality check on the alignment is load-bearing for the central claim.
minor comments (1)
- [Abstract and Experiments] The abstract states headline F1 numbers without reference to model architecture, validation splits, or confidence intervals; these details should appear in the main experimental section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below.
read point-by-point responses
-
Referee: [Alignment methods section] The section describing the manual and automated schema alignment provides no validation metrics (inter-annotator agreement, held-out accuracy against gold universal labels, or error analysis of collapsed acts). Because the headline unsupervised (54.1%) and semi-supervised (57.7%) F1 scores on the human-human target corpus are obtained by training on the aligned labels, the absence of any quality check on the alignment is load-bearing for the central claim.
Authors: We agree that explicit validation metrics for the alignment are necessary to support the central claim. In the revised manuscript we will report inter-annotator agreement on a sampled subset of the manual alignments, held-out accuracy of the automated alignment against a small set of gold universal labels, and a qualitative error analysis of the collapsed acts. These additions will be placed in the alignment section and will be independent of the downstream tagging results. revision: yes
Circularity Check
No circularity: performance metrics are independent empirical outcomes
full rationale
The paper defines a universal DA schema, describes manual/automated alignment of prior datasets to it, trains U-DAT on the aligned data, and reports F1 scores (54.1% unsupervised, 57.7% semi-supervised) on a separate target human-human corpus. None of these steps reduce by construction to the reported numbers; the F1 values are downstream results of training and evaluation rather than re-statements of fitted inputs, self-citations, or renamed patterns. The alignment step is a preprocessing choice whose correctness is an external assumption, not a definitional loop inside the derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing dialogue act annotation schemas from multiple datasets can be aligned to a single universal schema without substantial information loss or bias
invented entities (1)
-
Universal DA schema
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Introduction Dialogue acts (DAs) aim to portray the meaning of utterances at the level of illocutionary force, capturing a speaker’s inten- tion in producing that utterance [1]. DAs have been investi- gated by dialogue researchers for many years [2] and multi- ple taxonomies have been proposed [3, 4, 5] (see [6] for a review). Recent work in task-oriented...
-
[2]
Towards Universal Dialogue Act Tagging for Task-Oriented Dialogues
Related Work The mismatch between multiple DA taxonomies has been iden- tified by [6] previously, where a subset of ISO 24617-2 (the in- ternational ISO standard for DA annotation) tags [17] have been identified and annotations of multiple corpora were mapped to this set, focusing on social conversations. Our work has a simi- lar goal, but focuses on DAs th...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
D = u1, u2, ..., uN and A be the predefined set of M DAs i.e
DA Tagging for Dialogue Systems Let a dialogue D with N turns be denoted as a series of user and system utterances, ui, i.e. D = u1, u2, ..., uN and A be the predefined set of M DAs i.e. A = a1, a2..., aM . Given an utterance ui and its conversation history, DA tagging aims to predict the set of DAs Ai ⊂ A of ui. We use a deep neural network based model fo...
-
[4]
A bi-directional LSTM to encode each utterance, ui = wi 1, ..., wi ni, where ni denotes the number of tokens in ui and the final utterance representation for utterance zi is obtained by concatenating the last hidden layer of the forward LSTM, − − − − →LST M, and the first hidden layer of the backward LSTM, ← − − − −LST M: zi = ← − − − −LST M (ui) ⊕ − − − − ...
-
[5]
A hierarchical, uni-directional LSTM to encode the dia- logue level information, ei: ei = LST M (z1, ...zi−1)
-
[6]
An indicator number, gi, representing whether the agent is user or the system, i.e., gi = 0 , if uagent i = user, gi = 1 otherwise
-
[7]
A DA vector is represented as a many-hot vector di of dimension M, where we mark the true DAs as 1
Encoding over past DA(s) pi, where the final represen- tation is obtained by concatenating the many-hot repre- sentations of past-DAs. A DA vector is represented as a many-hot vector di of dimension M, where we mark the true DAs as 1. pi = d1 ⊕ d2 ⊕ ... ⊕ di−1 The final encoded contextCi is given by: Ci = ei ⊕ gi ⊕ pi (2) Ci is then fed into a feed forward ...
-
[8]
There- fore, we need a unified representation of all the acts present across the datasets
Datasets and Experiments Our aim is to train a Universal DA tagger using public datasets, but the label spaces across these datasets are not aligned. There- fore, we need a unified representation of all the acts present across the datasets. We obtain this representation by manu- ally going through the datasets and aligning semantically simi- lar sentences ...
-
[9]
for DAs. The GSim data has two parts and was collected by generating dialogue flows for movie (GSim-M) and restaurant (GSim-R) booking domains, where the individual turns from simulation in terms of DAs and associated arguments were then converted to natural language by crowd workers. DSTC2 con- tains human-machine interactions collected for the second di-...
-
[10]
embeddings and fine-tune during training
-
[11]
Universal DA Schema 5.1. Union of acts based on namespace In order to align the respective acts in the datasets (GSim and DSTC2), we first took a union of all the acts based on their names to create a unified representation. Figure 1 repre- Figure 1: Distribution of system acts across datasets Table 2: Examples of manual alignment of acts in all datasets. G...
-
[12]
Mod1: offer/select- I found a show for 7.30 pm/I found shows for 5 pm and 7 pm. We merge these acts
-
[13]
Mod2: user-request/sys-request - What is the phone number?/What kind of food would you like? We merge these acts
-
[14]
‘yes, 7pm’ can become affirm, inform(time=7pm) from affirm(time=7pm)
Mod3: affirm(x=y)/affirm + inform(x=y) - affirm with slots is equivalent to separate affirm and inform DAs, for eg. ‘yes, 7pm’ can become affirm, inform(time=7pm) from affirm(time=7pm). We split them
-
[15]
Mod4: reqalts/reqmore - Is there anything else?/Can i help you with anything else? We merge these acts. We merged/split DAs like the aforementioned ones, as they can easily be restored using other information. For example, if mul- tiple results are offered, we could convert anoffer act to a select act, or depending on the agent, we can convert a request a...
-
[16]
This ver- sion of the dataset only has DAs for the system turns
DA Tagging of Human-Human Datasets For experimenting with DA annotation of human-human (HH) dialogues, we used MultiWOZ-2.0[13] as our dataset. This ver- sion of the dataset only has DAs for the system turns. To do an evaluation on MultiWOZ-2.0, we first need to 1Details in Appendix, Table 7 Table 5: Universal DA schema ack, affirm, bye, deny, inform, repea...
-
[17]
Conclusions We are interested in DA tagging of human-human conversations with the final goal of end-to-end training of task-oriented di- alogue systems, so that we can generate system actions for a given dialogue context. In this work, we investigated multi- ple annotated human-machine conversation datasets, with dif- ferences in DA schema. We discussed ma...
-
[18]
J. L. Austin, How to do things with words . Oxford university press, 1975
work page 1975
-
[19]
Dialogue act modeling for automatic tagging and recognition of conversational speech,
A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Ju- rafsky, P. Taylor, R. Martin, C. V . Ess-Dykema, and M. Meteer, “Dialogue act modeling for automatic tagging and recognition of conversational speech,” Computational linguistics, vol. 26, no. 3, pp. 339–373, 2000
work page 2000
-
[20]
A. H. Anderson, M. Bader, E. G. Bard, E. Boyle, G. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Milleret al., “The hcrc map task corpus,” Language and speech, vol. 34, no. 4, pp. 351–366, 1991
work page 1991
-
[21]
Coding dialogs with the damsl anno- tation scheme,
M. G. Core and J. Allen, “Coding dialogs with the damsl anno- tation scheme,” in AAAI fall symposium on communicative action in humans and machines, vol. 56. Boston, MA, 1997
work page 1997
-
[22]
The dit++ taxonomy for functional dialogue markup,
H. Bunt, “The dit++ taxonomy for functional dialogue markup,” in AAMAS 2009 Workshop, Towards a Standard Markup Lan- guage for Embodied Dialogue Acts, 2009, pp. 13–24
work page 2009
-
[23]
ISO-Standard Domain-Independent Dialogue Act Tagging for Conversational Agents
S. Mezza, A. Cervone, G. Tortoreto, E. A. Stepanov, and G. Ric- cardi, “Iso-standard domain-independent dialogue act tagging for conversational agents,”arXiv preprint arXiv:1806.04327, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
S. Young, “Cued standard dialogue acts,” Report, Cambridge Uni- versity Engineering Department, 14th October, vol. 2007, 2007
work page 2007
-
[25]
Towards an iso standard for dialogue act annotation,
H. Bunt, J. Alexandersson, J. Carletta, J.-W. Choe, A. C. Fang, K. Hasida, K. Lee, V . Petukhova, A. Popescu-Belis, L. Romary et al. , “Towards an iso standard for dialogue act annotation,” in Seventh conference on International Language Resources and Evaluation (LREC’10), 2010
work page 2010
-
[26]
P. Shah, D. Hakkani-T ¨ur, B. Liu, and G. Tur, “Bootstrapping a neural conversational agent with dialogue self-play, crowdsourc- ing and on-line reinforcement learning,” in Proceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technolo- gies (NAACL/HLT), vol. 3, 2018, pp. 41–51
work page 2018
-
[27]
Back-off action selection in summary space-based pomdp dialogue systems,
M. Ga ˇsi´c, F. Lefevre, F. Jurˇc´ıˇcek, S. Keizer, F. Mairesse, B. Thom- son, K. Yu, and S. Young, “Back-off action selection in summary space-based pomdp dialogue systems,” in IEEE Workshop on Au- tomatic Speech Recognition & Understanding. IEEE, 2009, pp. 456–461
work page 2009
-
[28]
J. D. Williams, K. Asadi, and G. Zweig, “Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning,” in 55th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL), 2017
work page 2017
-
[29]
B. Liu, G. Tur, D. Hakkani-T ¨ur, P. Shah, and L. Heck, “Dialogue learning with human teaching and feedback in end-to-end train- able task-oriented dialogue systems,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT), 2018
work page 2018
-
[30]
Multiwoz-a large-scale multi- domain wizard-of-oz dataset for task-oriented dialogue mod- elling,
P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, S. Ultes, . O. Ramadan, and M. Ga ˇsi´c, “Multiwoz-a large-scale multi- domain wizard-of-oz dataset for task-oriented dialogue mod- elling,” arXiv preprint arXiv:1810.00278, 2018
-
[31]
Edina: Building an open domain socialbot with self-dialogues,
B. Krause, M. Damonte, M. Dobre, D. Duma, J. Fainberg, F. Fan- cellu, E. Kahembwe, J. Cheng, and B. Webber, “Edina: Building an open domain socialbot with self-dialogues,” Alexa Prize Pro- ceedings, 2017
work page 2017
-
[32]
Switchboard: Telephone speech corpus for research and development,
J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” in icassp. IEEE, 1992, pp. 517–520
work page 1992
-
[33]
The second di- alog state tracking challenge,
M. Henderson, B. Thomson, and J. D. Williams, “The second di- alog state tracking challenge,” in Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2014, pp. 263–272
work page 2014
-
[34]
The semantics of dialogue acts,
H. Bunt, “The semantics of dialogue acts,” in Proceedings of the Ninth International Conference on Computational Semantics. Association for Computational Linguistics, 2011, pp. 1–13
work page 2011
-
[35]
Automatic dialog act seg- mentation and classification in multiparty meetings,
J. Ang, Y . Liu, and E. Shriberg, “Automatic dialog act seg- mentation and classification in multiparty meetings,” in Proceed- ings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. , vol. 1. IEEE, 2005, pp. I–1061
work page 2005
-
[36]
Joint segmentation and classification of dialog acts using conditional random fields,
M. Zimmermann, “Joint segmentation and classification of dialog acts using conditional random fields,” inTenth Annual Conference of the International Speech Communication Association, 2009
work page 2009
-
[37]
Dialog act tagging using graphical models,
G. Ji and J. Bilmes, “Dialog act tagging using graphical models,” in Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. , vol. 1. IEEE, 2005, pp. I–33
work page 2005
-
[38]
Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks
J. Y . Lee and F. Dernoncourt, “Sequential short-text classification with recurrent and convolutional neural networks,”arXiv preprint arXiv:1603.03827, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[39]
Using context information for dialog act classification in dnn framework,
Y . Liu, K. Han, Z. Tan, and Y . Lei, “Using context information for dialog act classification in dnn framework,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2170–2178
work page 2017
-
[40]
The icsi meeting recorder dialog act (mrda) corpus,
E. Shriberg, R. Dhillon, S. Bhagat, J. Ang, and H. Carvey, “The icsi meeting recorder dialog act (mrda) corpus,” in Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT- NAACL 2004, 2004
work page 2004
-
[41]
Domain adaptation with unlabeled data for dialog act tagging,
A. Margolis, K. Livescu, and M. Ostendorf, “Domain adaptation with unlabeled data for dialog act tagging,” in Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing. Association for Computational Linguistics, 2010, pp. 45–52
work page 2010
-
[42]
Enrich- ing word vectors with subword information,
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich- ing word vectors with subword information,” Transactions of the Association for Computational Linguistics , vol. 5, pp. 135–146, 2017. A. Appendix Table 7: Alignment of Datasets with Universal DA Schema GSim-R GSim-M DSTC2 Universal DA Schema inform(x=y) inform(x=y) inform(x=y) inform(x=y) reque...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.