Is It Worth the Attention? A Comparative Evaluation of Attention Layers for Argument Unit Segmentation

Hendrik Heuer; Jonas Klaff; Maximilian Splieth\"over

arxiv: 1906.10068 · v1 · pith:IB75RUHXnew · submitted 2019-06-24 · 💻 cs.CL · cs.LG

Is It Worth the Attention? A Comparative Evaluation of Attention Layers for Argument Unit Segmentation

Maximilian Splieth\"over , Jonas Klaff , Hendrik Heuer This is my paper

Pith reviewed 2026-05-25 17:19 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords argumentation miningargument unit segmentationattention mechanismsbidirectional LSTMcontextualized embeddingscomparative evaluationsequence labeling

0 comments

The pith

Adding attention layers to bidirectional LSTMs does not improve argument unit segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a comparative evaluation of attention mechanisms added to bidirectional long short-term memory networks for the task of segmenting argumentative units in text. It also tests sentence-level contextualized word embeddings against pre-generated embeddings as input. The results indicate that the attention layer brings no performance gain over the baseline BiLSTM model. Contextualized embeddings likewise fail to improve scores in most evaluated cases. This suggests that for argument unit segmentation, added complexity from attention is not justified by the outcomes.

Core claim

For the task of argument unit segmentation, incorporating an additional attention layer into a bidirectional LSTM does not yield better performance than the baseline model alone, and contextualized embeddings do not consistently outperform pre-generated embeddings.

What carries the argument

Bidirectional long short-term memory network as the base model for argument unit segmentation, tested with and without added attention layers and with different embedding inputs.

If this is right

Simpler bidirectional LSTM models without attention can match or exceed more complex variants for this segmentation task.
Attention mechanisms should not be added by default to every sequence labeling pipeline in argumentation mining.
Pre-generated embeddings remain a competitive choice when contextualized embeddings show no clear advantage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The findings could motivate similar controlled comparisons on other subtasks in argumentation mining to determine where attention is genuinely useful.
Task-specific factors in argumentative text may limit the benefits that attention provides in other NLP domains.
Efficiency considerations might favor baseline models when performance differences are negligible.

Load-bearing premise

The bidirectional long short-term memory network is the current state-of-the-art approach to the unit segmentation task and serves as a fair baseline for the comparison.

What would settle it

A new experiment in which an attention-augmented bidirectional LSTM achieves measurably higher F1 scores than the plain bidirectional LSTM on the same argument unit segmentation datasets would falsify the central finding.

Figures

Figures reproduced from arXiv: 1906.10068 by Hendrik Heuer, Jonas Klaff, Maximilian Splieth\"over.

**Figure 1.** Figure 1: (a) The original baseline architecture as reported by Ajjour et al. (2017). (b) The modified baseline architecture without the second input Bi-LSTM. The bold arrows show the positions at which the additional attention layers are added to build the baseline+input and baseline+error architectures. (c) The bilstm architecture incorporates only one Bi-LSTM. The bold arrow shows the position at which the addit… view at source ↗

**Figure 2.** Figure 2: The loss curves of the baseline architecture using [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Attention mechanisms have seen some success for natural language processing downstream tasks in recent years and generated new State-of-the-Art results. A thorough evaluation of the attention mechanism for the task of Argumentation Mining is missing, though. With this paper, we report a comparative evaluation of attention layers in combination with a bidirectional long short-term memory network, which is the current state-of-the-art approach to the unit segmentation task. We also compare sentence-level contextualized word embeddings to pre-generated ones. Our findings suggest that for this task the additional attention layer does not improve upon a less complex approach. In most cases, the contextualized embeddings do also not show an improvement on the baseline score.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds attention adds no value over BiLSTM on argument unit segmentation and contextual embeddings mostly do not help either, but the SOTA baseline claim is asserted rather than shown.

read the letter

This paper runs a head-to-head test of attention layers on top of BiLSTM for argument unit segmentation and also checks contextualized embeddings against pre-generated ones. The headline result is that the extra attention does not improve scores and the contextual embeddings usually do not either. That is the one concrete thing a reader should take away: a negative finding on a specific task and architecture family. The work is a straightforward empirical comparison using known pieces, so it supplies a data point for people already working on argument mining who might otherwise default to adding attention. No new method or derivation is claimed, which keeps the scope modest but also keeps the claims grounded in what was actually run. The experiments are original in the sense that this exact combination had not been reported before. The soft spot is the baseline. The abstract states that the BiLSTM is the current state-of-the-art for the task and uses it as the reference point, yet supplies no citation details, re-implementation notes, or confirmation that no stronger non-attention model existed. If that baseline was not tuned to the same level as the attention variants or does not match the prior best result, the no-improvement conclusion weakens. The abstract also omits dataset names, hyperparameter search, statistical tests, and error bars, so the numbers cannot be checked from the summary alone. The paper is narrow—one task, one architecture family—and carries no formal proofs or reusable code. It is mainly useful to specialists in argument mining who need evidence against automatic adoption of attention. Most other NLP readers will not find enough here to change their practice. I would not bring it to a general reading group and would not cite it. It does not look strong enough for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a comparative evaluation of attention mechanisms combined with bidirectional LSTM networks for the task of argument unit segmentation in argumentation mining. It also assesses sentence-level contextualized word embeddings against pre-generated embeddings. The central claim is that adding an attention layer does not improve performance over the BiLSTM baseline, and that contextualized embeddings generally do not improve upon the baseline scores.

Significance. If the empirical results hold under rigorous verification, the finding would suggest that for argument unit segmentation, simpler BiLSTM models without additional attention are sufficient, which could influence model selection in argumentation mining research and encourage focus on other factors like data quality or alternative architectures. The work contributes an empirical benchmark comparison in a specialized NLP task.

major comments (2)

[Introduction] Introduction: The assertion that the bidirectional LSTM 'is the current state-of-the-art approach to the unit segmentation task' is load-bearing for the headline claim yet lacks a specific citation to the exact prior result establishing SOTA status; the paper must also document whether the re-implementation matches the original architecture, hyperparameters, and training regime exactly, otherwise the conclusion that attention adds no value does not follow.
[Experimental setup] Experimental setup / results sections: No information is supplied on hyperparameter search procedure, number of random seeds or runs, statistical significance testing, or error bars; without these the reported 'no improvement' differences cannot be verified as robust rather than artifacts of a single run.

minor comments (2)

[Abstract] Abstract: The datasets, exact metrics, and number of attention variants tested should be named explicitly so the scope of the 'in most cases' qualifier is clear.
[Results] Notation and figures: Ensure consistent labeling of the BiLSTM-only baseline versus attention-augmented variants across tables and text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to strengthen the presentation of our results.

read point-by-point responses

Referee: [Introduction] Introduction: The assertion that the bidirectional LSTM 'is the current state-of-the-art approach to the unit segmentation task' is load-bearing for the headline claim yet lacks a specific citation to the exact prior result establishing SOTA status; the paper must also document whether the re-implementation matches the original architecture, hyperparameters, and training regime exactly, otherwise the conclusion that attention adds no value does not follow.

Authors: We will add an explicit citation to the prior work establishing the BiLSTM as SOTA for argument unit segmentation. We will also expand the methods section with a detailed side-by-side comparison of our re-implementation against the original architecture, hyperparameters, and training procedure to confirm fidelity. revision: yes
Referee: [Experimental setup] Experimental setup / results sections: No information is supplied on hyperparameter search procedure, number of random seeds or runs, statistical significance testing, or error bars; without these the reported 'no improvement' differences cannot be verified as robust rather than artifacts of a single run.

Authors: We agree these details are required for verification. The revised manuscript will describe the hyperparameter search, report the number of random seeds/runs performed, include statistical significance tests, and add error bars to all reported scores. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical model comparison

full rationale

The paper performs an empirical comparison of attention-augmented BiLSTM models against a BiLSTM baseline and contextualized vs. static embeddings on argument unit segmentation. It contains no equations, derivations, fitted parameters presented as predictions, or self-citation chains that reduce the central claim to its own inputs. The statement that BiLSTM is the current SOTA is an external claim about prior literature rather than a self-referential definition or fitted result within this work. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the experimental comparison being fair and on the stated premise that BiLSTM is the current SOTA baseline; no theoretical constructs or new entities are introduced.

axioms (1)

domain assumption BiLSTM is the current state-of-the-art approach to the unit segmentation task
Explicitly invoked in the abstract to justify the baseline choice.

pith-pipeline@v0.9.0 · 5646 in / 1100 out tokens · 33320 ms · 2026-05-25T17:19:22.953350+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bidirectional long short-term memory network, which is the current state-of-the-art approach to the unit segmentation task
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

attention layer does not improve upon a less complex approach

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 5 internal anchors

[1]

Yamen Ajjour, Wei-Fan Chen, Johannes Kiesel, Henning Wachsmuth, and Benno Stein. 2017. https://doi.org/10.18653/v1/W17-5115 Unit Segmentation of Argumentative Texts . In Proceedings of the 4th Workshop on Argument Mining , pages 118--128, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/w17-5115 2017
[2]

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. http://aclweb.org/anthology/C18-1139 Contextual String Embeddings for Sequence Labeling . In Proceedings of the 27th International Conference on Computational Linguistics , pages 1638--1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics

work page 2018
[3]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. http://arxiv.org/abs/1409.0473 Neural Machine Translation by Jointly Learning to Align and Translate . arXiv: 1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. https://doi.org/10.1162/tacl_a_00051 Enriching Word Vectors with Subword Information . Transactions of the Association for Computational Linguistics, 5:135--146

work page doi:10.1162/tacl_a_00051 2017
[5]

Elena Cabrio and Serena Villata. 2018. https://doi.org/10.24963/ijcai.2018/766 Five Years of Argument Mining : a Data -driven Analysis . In Proceedings of the Twenty - Seventh International Joint Conference on Artificial Intelligence , pages 5427--5433, Stockholm, Sweden. International Joint Conferences on Artificial Intelligence Organization

work page doi:10.24963/ijcai.2018/766 2018
[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. http://arxiv.org/abs/1810.04805 BERT : Pre -training of Deep Bidirectional Transformers for Language Understanding . arXiv: 1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Steffen Eger, Johannes Daxenberger, and Iryna Gurevych. 2017. https://doi.org/10.18653/v1/P17-1002 Neural End -to- End Learning for Computational Argumentation Mining . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 11--22, Vancouver, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/p17-1002 2017
[8]

Google AI Research . 2018. https://github.com/google-research/bert TensorFlow code and pre-trained models for BERT . https://github.com/google-research/bert, last accessed: 2019-05-01, 21:40UTC+2

work page 2018
[9]

Hendrik Heuer. 2015. https://aaltodoc.aalto.fi:443/handle/123456789/17732 Semantic and stylistic text analysis and text summary evaluation . Master thesis

work page arXiv 2015
[10]

Zhao HG. 2018 a . https://github.com/CyberZHG/keras-self-attention Attention mechanism for processing sequential data that considers the context for each timestamp. https://github.com/CyberZHG/keras-self-attention, last accessed: 2019-05-01, 21:39UTC+2

work page 2018
[11]

Zhao HG. 2018 b . https://github.com/CyberZHG/keras-multi-head A wrapper layer for stacking layers horizontally. https://github.com/CyberZHG/keras-multi-head, last accessed: 2019-05-01, 21:40UTC+2

work page 2018
[12]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. https://doi.org/10.1162/neco.1997.9.8.1735 Long Short - Term Memory . Neural Computation, 9(8):1735--1780

work page doi:10.1162/neco.1997.9.8.1735 1997
[13]

Jeremy Howard and Sebastian Ruder. 2018. https://aclweb.org/anthology/papers/P/P18/P18-1031/ Universal Language Model Fine -tuning for Text Classification . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 328--339, Melbourne, Australia. Association for Computational Linguistics

work page 2018
[14]

Laurent Itti, Christof Koch, and Ernst Niebur. 1998. https://doi.org/10.1109/34.730558 A model of saliency-based visual attention for rapid scene analysis . IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254--1259

work page doi:10.1109/34.730558 1998
[15]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. http://arxiv.org/abs/1609.04836 On Large - Batch Training for Deep Learning : Generalization Gap and Sharp Minima . arXiv: 1609.04836

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. http://arxiv.org/abs/1301.3781 Efficient Estimation of Word Representations in Vector Space . arXiv: 1301.3781

work page internal anchor Pith review Pith/arXiv arXiv 2013
[17]

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. http://arxiv.org/abs/1406.6247 Recurrent Models of Visual Attention . arXiv: 1406.6247

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

Gaku Morio and Katsuhide Fujita. 2018. https://aclweb.org/anthology/papers/W/W18/W18-5202/ End-to- End Argument Mining for Discussion Threads Based on Parallel Constrained Pointer Architecture . In Proceedings of the 5th Workshop on Argument Mining , pages 11--21, Brussels, Belgium. Association for Computational Linguistics

work page 2018
[19]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. https://doi.org/10.3115/v1/D14-1162 Glove: Global Vectors for Word Representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 1532--1543, Doha, Qatar. Association for Computational Linguistics

work page doi:10.3115/v1/d14-1162 2014
[20]

Mike Schuster and Kuldip K. Paliwal. 1997. https://doi.org/10.1109/78.650093 Bidirectional recurrent neural networks . IEEE Trans. Signal Processing, 45:2673--2681

work page doi:10.1109/78.650093 1997
[21]

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf Practical Bayesian Optimization of Machine Learning Algorithms . In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25 , pages 29...

work page 2012
[22]

Christian Stab and Iryna Gurevych. 2017. https://doi.org/10.1162/COLI_a_00295 Parsing Argumentation Structures in Persuasive Essays . Computational Linguistics, 43(3):619--659

work page doi:10.1162/coli_a_00295 2017
[23]

Christian Stab, Tristan Miller, Benjamin Schiller, Pranav Rai, and Iryna Gurevych. 2018. http://aclweb.org/anthology/D18-1402 Cross-topic Argument Mining from Heterogeneous Sources . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 3664--3674. Association for Computational Linguistics. Event-place: Brussels...

work page 2018
[24]

Christian Matthias Edwin Stab. 2017. http://tuprints.ulb.tu-darmstadt.de/6006/ Argumentative Writing Support by means of Natural Language Processing . Dissertation, Technische Universität Darmstadt, Darmstadt

work page 2017
[25]

Mayer Tobias, Cabrio Elena, Lippi Marco, Torroni Paolo, and Villata Serena. 2018. https://doi.org/10.3233/978-1-61499-906-5-137 Argument Mining on Clinical Trials . Frontiers in Artificial Intelligence and Applications, pages 137--148

work page doi:10.3233/978-1-61499-906-5-137 2018
[26]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. http://dl.acm.org/citation.cfm?id=3295222.3295349 Attention is All You Need . In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS '17, pages 6000--6010, USA. Curran Associates I...

work page arXiv 2017
[27]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. http://proceedings.mlr.press/v37/xuc15.html Show, Attend and Tell : Neural Image Caption Generation with Visual Attention . In Proceedings of the 32nd International Conference on Machine Learning , volume 37 of Proceedings of Machine ...

work page 2015
[28]

Zalando Research . 2018. https://github.com/zalandoresearch/flair A very simple framework for state-of-the-art Natural Language Processing ( NLP ) . https://github.com/zalandoresearch/flair, last accessed: 2019-05-01, 21:39UTC+2

work page 2018
[29]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[30]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Yamen Ajjour, Wei-Fan Chen, Johannes Kiesel, Henning Wachsmuth, and Benno Stein. 2017. https://doi.org/10.18653/v1/W17-5115 Unit Segmentation of Argumentative Texts . In Proceedings of the 4th Workshop on Argument Mining , pages 118--128, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/w17-5115 2017

[2] [2]

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. http://aclweb.org/anthology/C18-1139 Contextual String Embeddings for Sequence Labeling . In Proceedings of the 27th International Conference on Computational Linguistics , pages 1638--1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics

work page 2018

[3] [3]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. http://arxiv.org/abs/1409.0473 Neural Machine Translation by Jointly Learning to Align and Translate . arXiv: 1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014

[4] [4]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. https://doi.org/10.1162/tacl_a_00051 Enriching Word Vectors with Subword Information . Transactions of the Association for Computational Linguistics, 5:135--146

work page doi:10.1162/tacl_a_00051 2017

[5] [5]

Elena Cabrio and Serena Villata. 2018. https://doi.org/10.24963/ijcai.2018/766 Five Years of Argument Mining : a Data -driven Analysis . In Proceedings of the Twenty - Seventh International Joint Conference on Artificial Intelligence , pages 5427--5433, Stockholm, Sweden. International Joint Conferences on Artificial Intelligence Organization

work page doi:10.24963/ijcai.2018/766 2018

[6] [6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. http://arxiv.org/abs/1810.04805 BERT : Pre -training of Deep Bidirectional Transformers for Language Understanding . arXiv: 1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Steffen Eger, Johannes Daxenberger, and Iryna Gurevych. 2017. https://doi.org/10.18653/v1/P17-1002 Neural End -to- End Learning for Computational Argumentation Mining . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 11--22, Vancouver, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/p17-1002 2017

[8] [8]

Google AI Research . 2018. https://github.com/google-research/bert TensorFlow code and pre-trained models for BERT . https://github.com/google-research/bert, last accessed: 2019-05-01, 21:40UTC+2

work page 2018

[9] [9]

Hendrik Heuer. 2015. https://aaltodoc.aalto.fi:443/handle/123456789/17732 Semantic and stylistic text analysis and text summary evaluation . Master thesis

work page arXiv 2015

[10] [10]

Zhao HG. 2018 a . https://github.com/CyberZHG/keras-self-attention Attention mechanism for processing sequential data that considers the context for each timestamp. https://github.com/CyberZHG/keras-self-attention, last accessed: 2019-05-01, 21:39UTC+2

work page 2018

[11] [11]

Zhao HG. 2018 b . https://github.com/CyberZHG/keras-multi-head A wrapper layer for stacking layers horizontally. https://github.com/CyberZHG/keras-multi-head, last accessed: 2019-05-01, 21:40UTC+2

work page 2018

[12] [12]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. https://doi.org/10.1162/neco.1997.9.8.1735 Long Short - Term Memory . Neural Computation, 9(8):1735--1780

work page doi:10.1162/neco.1997.9.8.1735 1997

[13] [13]

Jeremy Howard and Sebastian Ruder. 2018. https://aclweb.org/anthology/papers/P/P18/P18-1031/ Universal Language Model Fine -tuning for Text Classification . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 328--339, Melbourne, Australia. Association for Computational Linguistics

work page 2018

[14] [14]

Laurent Itti, Christof Koch, and Ernst Niebur. 1998. https://doi.org/10.1109/34.730558 A model of saliency-based visual attention for rapid scene analysis . IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254--1259

work page doi:10.1109/34.730558 1998

[15] [15]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. http://arxiv.org/abs/1609.04836 On Large - Batch Training for Deep Learning : Generalization Gap and Sharp Minima . arXiv: 1609.04836

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. http://arxiv.org/abs/1301.3781 Efficient Estimation of Word Representations in Vector Space . arXiv: 1301.3781

work page internal anchor Pith review Pith/arXiv arXiv 2013

[17] [17]

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. http://arxiv.org/abs/1406.6247 Recurrent Models of Visual Attention . arXiv: 1406.6247

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

Gaku Morio and Katsuhide Fujita. 2018. https://aclweb.org/anthology/papers/W/W18/W18-5202/ End-to- End Argument Mining for Discussion Threads Based on Parallel Constrained Pointer Architecture . In Proceedings of the 5th Workshop on Argument Mining , pages 11--21, Brussels, Belgium. Association for Computational Linguistics

work page 2018

[19] [19]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. https://doi.org/10.3115/v1/D14-1162 Glove: Global Vectors for Word Representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 1532--1543, Doha, Qatar. Association for Computational Linguistics

work page doi:10.3115/v1/d14-1162 2014

[20] [20]

Mike Schuster and Kuldip K. Paliwal. 1997. https://doi.org/10.1109/78.650093 Bidirectional recurrent neural networks . IEEE Trans. Signal Processing, 45:2673--2681

work page doi:10.1109/78.650093 1997

[21] [21]

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf Practical Bayesian Optimization of Machine Learning Algorithms . In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25 , pages 29...

work page 2012

[22] [22]

Christian Stab and Iryna Gurevych. 2017. https://doi.org/10.1162/COLI_a_00295 Parsing Argumentation Structures in Persuasive Essays . Computational Linguistics, 43(3):619--659

work page doi:10.1162/coli_a_00295 2017

[23] [23]

Christian Stab, Tristan Miller, Benjamin Schiller, Pranav Rai, and Iryna Gurevych. 2018. http://aclweb.org/anthology/D18-1402 Cross-topic Argument Mining from Heterogeneous Sources . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 3664--3674. Association for Computational Linguistics. Event-place: Brussels...

work page 2018

[24] [24]

Christian Matthias Edwin Stab. 2017. http://tuprints.ulb.tu-darmstadt.de/6006/ Argumentative Writing Support by means of Natural Language Processing . Dissertation, Technische Universität Darmstadt, Darmstadt

work page 2017

[25] [25]

Mayer Tobias, Cabrio Elena, Lippi Marco, Torroni Paolo, and Villata Serena. 2018. https://doi.org/10.3233/978-1-61499-906-5-137 Argument Mining on Clinical Trials . Frontiers in Artificial Intelligence and Applications, pages 137--148

work page doi:10.3233/978-1-61499-906-5-137 2018

[26] [26]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. http://dl.acm.org/citation.cfm?id=3295222.3295349 Attention is All You Need . In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS '17, pages 6000--6010, USA. Curran Associates I...

work page arXiv 2017

[27] [27]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. http://proceedings.mlr.press/v37/xuc15.html Show, Attend and Tell : Neural Image Caption Generation with Visual Attention . In Proceedings of the 32nd International Conference on Machine Learning , volume 37 of Proceedings of Machine ...

work page 2015

[28] [28]

Zalando Research . 2018. https://github.com/zalandoresearch/flair A very simple framework for state-of-the-art Natural Language Processing ( NLP ) . https://github.com/zalandoresearch/flair, last accessed: 2019-05-01, 21:39UTC+2

work page 2018

[29] [29]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[30] [30]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page