pith. sign in

arxiv: 1907.03070 · v1 · pith:GVLN4OWOnew · submitted 2019-07-06 · 💻 cs.CL

Short Text Conversation Based on Deep Neural Network and Analysis on Evaluation Measures

Pith reviewed 2026-05-25 01:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords short text conversationdialogue qualitynugget detectiondeep neural networkBERTevaluation measureschatbot evaluation
0
0 comments X

The pith

Hierarchical neural networks using BERT outperform prior models on automatic evaluation of chatbot dialogues for quality and nugget detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops deep neural network models to automate evaluation of short text conversations produced by chatbots, targeting the Dialogue Quality and Nugget Detection subtasks to reduce dependence on costly human annotation. Models are built with a hierarchical structure of embedding, utterance, context, and memory layers that progressively learn representations from word level to long-range context, with gating and attention added at intermediate layers. Substituting BERT for the embedding and utterance layers produces stronger sentence representations than multi-stack CNN, resulting in better performance than other models on both subtasks. The work additionally runs experiments with traditional metrics such as accuracy and F1 to compare against the specialized measures NMD, RSNOD, JSD, and RNSS. A reader would care because scalable automatic scoring could support faster iteration on customer-service chatbots.

Core claim

The authors establish that hierarchical models built from embedding, utterance, context, and memory layers with gating and attention solve the DQ and ND subtasks, and that BERT yields better utterance representations than multi-stack CNN, outperforming other proposed models. They further show that the specialized measures NMD, RSNOD for DQ and JSD, RNSS for ND reveal performance patterns distinct from those seen under accuracy, precision, recall, and F1-score.

What carries the argument

Hierarchical dialogue representation built from embedding layer, utterance layer, context layer, and memory layer, with gating and attention at utterance and context layers; BERT used as replacement for embedding and utterance layers.

If this is right

  • Automatic scoring of chatbot dialogues becomes practical without large-scale human annotation.
  • BERT supplies stronger utterance-level features than multi-stack CNN for both subtasks.
  • Specialized measures expose aspects of model behavior not captured by accuracy or F1.
  • The memory layer supports modeling of longer conversational context.
  • The same architecture can be applied to other short-text dialogue evaluation problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pre-trained encoders like BERT may become standard building blocks for dialogue evaluation systems.
  • The measure comparison could motivate hybrid scoring protocols that combine traditional and specialized metrics.
  • The layered approach might transfer to multi-turn or multi-party conversations outside customer service.

Load-bearing premise

The hierarchical layers with gating and attention capture enough dialogue context to produce measurable gains over baselines on the DQ and ND subtasks.

What would settle it

The BERT version of the model shows no improvement over the multi-stack CNN baseline when scored with NMD on the DQ task or JSD on the ND task on a new test set.

read the original abstract

With the development of Natural Language Processing, Automatic question-answering system such as Waston, Siri, Alexa, has become one of the most important NLP applications. Nowadays, enterprises try to build automatic custom service chatbots to save human resources and provide a 24-hour customer service. Evaluation of chatbots currently relied greatly on human annotation which cost a plenty of time. Thus, has initiated a new Short Text Conversation subtask called Dialogue Quality (DQ) and Nugget Detection (ND) which aim to automatically evaluate dialogues generated by chatbots. In this paper, we solve the DQ and ND subtasks by deep neural network. We proposed two models for both DQ and ND subtasks which is constructed by hierarchical structure: embedding layer, utterance layer, context layer and memory layer, to hierarchical learn dialogue representation from word level, sentence level, context level to long range context level. Furthermore, we apply gating and attention mechanism at utterance layer and context layer to improve the performance. We also tried BERT to replace embedding layer and utterance layer as sentence representation. The result shows that BERT produced a better utterance representation than multi-stack CNN for both DQ and ND subtasks and outperform other models proposed by other researches. The evaluation measures are proposed by , that is, NMD, RSNOD for DQ and JSD, RNSS for ND, which is not traditional evaluation measures such as accuracy, precision, recall and f1-score. Thus, we have done a series of experiments by using traditional evaluation measures and analyze the performance and error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes two hierarchical deep neural network models for the Dialogue Quality (DQ) and Nugget Detection (ND) subtasks in automatic chatbot dialogue evaluation. The architecture consists of embedding, utterance, context, and memory layers, with gating and attention mechanisms applied at the utterance and context levels. The authors replace the embedding/utterance layers with BERT (versus multi-stack CNN) and claim that BERT yields superior utterance representations, leading to better performance on the custom metrics NMD/RSNOD (DQ) and JSD/RNSS (ND) while also outperforming prior models; they further analyze results using traditional metrics such as accuracy, precision, recall, and F1-score.

Significance. If the empirical claims are substantiated with complete, reproducible results, the work could contribute to reducing reliance on human annotation for chatbot evaluation by advancing automatic metrics and hierarchical dialogue representations. The explicit comparison of custom metrics against traditional ones and the exploration of BERT integration are potentially useful for the field, though the absence of any reported scores, ablations, or dataset details in the provided text prevents assessment of whether these contributions are realized.

major comments (3)
  1. [Abstract] Abstract: the central claim that 'BERT produced a better utterance representation than multi-stack CNN for both DQ and ND subtasks and outperform other models proposed by other researches' is unsupported by any numerical results, deltas, significance tests, or ablation studies, rendering the outperformance assertion unverifiable from the manuscript text.
  2. [Abstract] Abstract: no information is supplied on the datasets, baseline reimplementations, hyperparameter settings, or error bars for the DQ/ND experiments, which are load-bearing for validating that gains are attributable to the BERT substitution rather than other factors in the hierarchical model.
  3. [Abstract] Abstract: the description of the hierarchical structure (embedding + utterance + context + memory with gating/attention) and the switch to BERT lacks any equations, layer dimensions, or implementation specifics, preventing evaluation of whether the architecture is preserved across the CNN and BERT variants.
minor comments (2)
  1. [Abstract] Abstract contains grammatical issues, e.g., 'Thus, has initiated a new Short Text Conversation subtask' (incomplete sentence) and 'Waston' (should be Watson).
  2. [Abstract] The phrase 'the result shows that' is used without preceding any actual results or tables, which is confusing for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed feedback on our abstract. We agree that additional details will strengthen the presentation and will revise the abstract accordingly to include key results, experimental information, and architectural specifics.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'BERT produced a better utterance representation than multi-stack CNN for both DQ and ND subtasks and outperform other models proposed by other researches' is unsupported by any numerical results, deltas, significance tests, or ablation studies, rendering the outperformance assertion unverifiable from the manuscript text.

    Authors: The experimental section of the manuscript reports the relevant performance numbers, comparisons, and analyses supporting the claim. To ensure the abstract is self-contained and verifiable on its own, we will revise it to incorporate key numerical results, deltas, and references to the ablation studies and significance testing. revision: yes

  2. Referee: [Abstract] Abstract: no information is supplied on the datasets, baseline reimplementations, hyperparameter settings, or error bars for the DQ/ND experiments, which are load-bearing for validating that gains are attributable to the BERT substitution rather than other factors in the hierarchical model.

    Authors: Dataset descriptions, baseline details, and experimental settings appear in the methods and experiments sections. We will revise the abstract to briefly note the dataset(s), the baselines used, and that hyperparameters and error bars are reported in the full experimental results. revision: yes

  3. Referee: [Abstract] Abstract: the description of the hierarchical structure (embedding + utterance + context + memory with gating/attention) and the switch to BERT lacks any equations, layer dimensions, or implementation specifics, preventing evaluation of whether the architecture is preserved across the CNN and BERT variants.

    Authors: The model architecture, including equations and dimensions, is detailed in the methods section. We will revise the abstract to provide a concise summary of the hierarchical layers, gating/attention mechanisms, and how the BERT substitution is applied while preserving the overall structure, with explicit reference to the full specifications in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison with no derivations or self-referential predictions

full rationale

The paper proposes hierarchical neural architectures (embedding/utterance/context/memory layers with gating/attention) and reports experimental results comparing BERT vs. multi-stack CNN on DQ/ND subtasks using custom metrics (NMD/RSNOD, JSD/RNSS) plus traditional ones. No mathematical derivation chain exists; all claims rest on trained model performance rather than equations that reduce to inputs by construction, fitted parameters called predictions, or load-bearing self-citations. The central outperformance claim is an empirical assertion, not a derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit axioms, free parameters, or invented entities are stated in the abstract; the work is an empirical application of existing neural architectures.

pith-pipeline@v0.9.0 · 5809 in / 982 out tokens · 19262 ms · 2026-05-25T01:58:55.584903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    Data -Driven Response Generation in Social Media,

    A. Ritter, C. Cherry, W. B. Dolan, "Data -Driven Response Generation in Social Media," in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011

  2. [2]

    Dialogue act modeling for automatic tagging and recognition of conversational speech,

    A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van. Ess -Dykema, M. Meteer, "Dialogue act modeling for automatic tagging and recognition of conversational speech," in Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, 2000

  3. [3]

    Cobot in LambdaMOO: A Social Statistics Agent,

    C. L. Isbell. Jr, M. Kearns, D. Kormann, S. Singh, P. Stone, "Cobot in LambdaMOO: A Social Statistics Agent," in Proceedings of the Fourteenth AAAI Conference on Artificial Intelligence, 2000

  4. [4]

    Switchboard SWBD-DAMSL Shallow-Discourse-Function Annotation Coders Manual,

    D. Jurafsky, L. Shriberg, D. Biasca, "Switchboard SWBD-DAMSL Shallow-Discourse-Function Annotation Coders Manual," Draft 13, 1997

  5. [5]

    The ICSI meeting recorder dialog act (MRDA) corpus,

    E. Shriberg, R. Dhillon, S. Bhagat, J. Ang, H. Carvey, "The ICSI meeting recorder dialog act (MRDA) corpus," in Proceeding of the Human Language Technology Conference at the North American Chapter of the Association for Computational Linguistics, 2004

  6. [6]

    Dialogue Act Sequence Labeling using Hierarchical encoder with CRF,

    H. Kumar, A. Agarwal, R. Dasgupta, S. Joshi, A. Kumar, "Dialogue Act Sequence Labeling using Hierarchical encoder with CRF," i n Processings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  7. [7]

    Adversarial Learning for Neural Dialogue Generation,

    J. Li, M. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, "Adversarial Learning for Neural Dialogue Generation," in Proceedings of the 2017 conference on Empirical Methods in Natural Language Processing, 2017

  8. [8]

    Memory networks,

    J. Weston, S. Chopra, A. Bordes, "Memory networks," in Proceedings of the 3rd International Conference on Learning Representations, 2015

  9. [9]

    Diversity-Promoting GAN: A Cross- Entropy Based Generative Adversarial Network for Diversified Text Generation,

    J. Xu, X. Ren, J. Lin, X. Sun, "Diversity-Promoting GAN: A Cross- Entropy Based Generative Adversarial Network for Diversified Text Generation," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

  10. [10]

    Sequential short-text classification with recurrent and convolutional neural networks,

    J. Y. Lee, F, Dernoncourt, "Sequential short-text classification with recurrent and convolutional neural networks," in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2016

  11. [11]

    CUIS at the NTCIR-14 STC-3 DQ Subtask,

    K. Cong, W. Lam, "CUIS at the NTCIR-14 STC-3 DQ Subtask," in Proceedings of the 14th NTCIR conference on Evaluation of Information Access Technologies, 2019

  12. [12]

    Multi -Style Generative Reading Comprehension,

    K. Nishida, I. Saito, K. Nishida, K. Shinoda, A. Otsuka, H. Asano, J. Tomita, "Multi -Style Generative Reading Comprehension," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  13. [13]

    An Auto-Encoder Matching Model for Learning Utterance -Level Semantic Dependency in Dialogue Generation,

    L. Luo, J. Xu, J. Lin, Q. Zeng, X. Sun, "An Auto-Encoder Matching Model for Learning Utterance -Level Semantic Dependency in Dialogue Generation," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

  14. [14]

    Generating coherent argumentative paragraphs,

    M. Elhadad, "Generating coherent argumentative paragraphs," in Proceedings of the International Conference on Computational Linguistics, 1992

  15. [15]

    Reinforced Mnemonic Reader for Machine Reading Comprehen sion,

    M. Hu, Y. Peng, Z. Huang, X. Qiu, R. Wei, M. Zhou, "Reinforced Mnemonic Reader for Machine Reading Comprehen sion," in Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018

  16. [16]

    Joint segmentation and classification of dialog acts using conditional random fields,

    M. Zimmermann, "Joint segmentation and classification of dialog acts using conditional random fields," in Proceedings of the 10th Annual Conference of the International Speech Communication Association, pp. 864-867, 2009

  17. [17]

    A Deep Cascade Model for Multi -Document Reading Comprehension,

    M. Yan, J. Xia, C. Wu, B. Bi, Z. Zhao, J. Zhang, J. Si, R. Wang, W. Wang, H. Chen, "A Deep Cascade Model for Multi -Document Reading Comprehension," in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

  18. [18]

    MS MARCO: A Human Generated MAchine Reading COmprehe nsion Dataset,

    P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, T. Wang, "MS MARCO: A Human Generated MAchine Reading COmprehe nsion Dataset," in Proceedings of the Thirtieth Conference on Neural Information Processing Systems, 2016

  19. [19]

    Recurrent convolutional neural networks for discourse compositionality,

    P. Blunsom, N. Kalchbrenner, "Recurrent convolutional neural networks for discourse compositionality," i n Proceedings of the 2013 Workshop on Continu ous Vector Space Models and their Compositionality, 2013

  20. [20]

    Token-based chunking of turn-internal dialogue act sequences

    P. Lendvai and J, Geertzen: "Token-based chunking of turn-internal dialogue act sequences." 8th SIGDIAL Workshop on Discourse and Dialogue, 2007

  21. [21]

    SQuAD : 100,000+ Questions for Machine Comprehension of Text,

    P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, "SQuAD : 100,000+ Questions for Machine Comprehension of Text," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016

  22. [22]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    P. Rajpurkar, R. Kia, P. Liang, "Know What You Don’t Know: Unanswerable Questions for SQuAD," arXiv pre print arXiv:1806.03822, 2018

  23. [23]

    Natural Language Generation in the IBM Flight Information System,

    S. Axelrod, "Natural Language Generation in the IBM Flight Information System," in Proceedings of the ANLP/NAACL Workshop on Conversational Systems, 2000

  24. [24]

    SLSTC at the NTCIR -14 STC-3 Dialo gue Quality and Nugget Detection Subtasks,

    S. Kato, R. Suzuki, Z. Zeng, T. Sakai, "SLSTC at the NTCIR -14 STC-3 Dialo gue Quality and Nugget Detection Subtasks," in Proceedings of the 14th NTCIR conference on Evaluation of Information Access Technologies, 2019

  25. [25]

    A Multi -Stage Memory Augmented Neural Network for Machine Reading Comprehension,

    S. Yu, S. Indurthi, S. Back, H. Lee," A Multi -Stage Memory Augmented Neural Network for Machine Reading Comprehension," in Proceedings of the Workshop on Machine Reading for Question Answering, 2018

  26. [26]

    Comparing Two Binned Probability Distributions for Information Access Evaluation,

    T. Sakai, "Comparing Two Binned Probability Distributions for Information Access Evaluation,""in Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018

  27. [27]

    Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders,

    T. Zhao, R. Zhao, M. Eskenazi, "Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017

  28. [28]

    End -to-end sequence labeling via bi -directional lstm-cnns-crf,

    X. Ma, E. Hovy, "End -to-end sequence labeling via bi -directional lstm-cnns-crf," in Proceeding of the 54th Annual Meetin g of the Association for Computational Linguistics, 2016

  29. [29]

    DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset,

    Y. Li, H. Su, X. Shen, W. Li, Z. Cao, S. Niu, "DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset," in Proceedings of the Eighth International Joint Conference on Natural Language Processing, 2017

  30. [30]

    Using Context Information for Dialog Act Classification in DNN Framework,

    Y. Liu, K. Han, Z. Tan, Y. Lei, "Using Context Information for Dialog Act Classification in DNN Framework," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017

  31. [31]

    Bidirectional LSTM-CRF Models for Sequence Tagging

    Z. Huang, W. Xu, K. Yu, "Bidirectional lstm -crf models for sequence tagging," arXiv preprint arXiv:1508.01991, 2015

  32. [32]

    Overview of the NTCIR -14 Short Text Conversation Task: Dialogue Qualit y and Nugget Detection Subtasks,

    Z. Zeng, S. Kato and T. Sakai, “Overview of the NTCIR -14 Short Text Conversation Task: Dialogue Qualit y and Nugget Detection Subtasks,” Proceedings of the 14th NTCIR conference on Evaluation of Information Access, 2019