Simple Natural Language Processing Tools for Danish

Leon Derczynski

arxiv: 1906.11608 · v2 · pith:6C76VWR7new · submitted 2019-06-27 · 💻 cs.CL

Simple Natural Language Processing Tools for Danish

Leon Derczynski This is my paper

Pith reviewed 2026-05-25 14:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords DanishNatural Language ProcessingBaseline ToolsMachine LearningText ProcessingFreely Available Tools

0 comments

The pith

A set of baseline machine-learning tools for automatic Danish text processing has been created and made freely available.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a collection of simple tools that perform standard natural language processing tasks on Danish text. These tools rely on models trained using machine learning on previously labeled Danish documents. They are hosted at ITU Copenhagen with a commitment to remain freely accessible. A sympathetic reader would care because the tools supply ready starting points for handling Danish without requiring new model development from scratch. This makes automatic analysis of Danish text more practical for everyday use.

Core claim

The paper describes a set of baseline tools for automatic processing of Danish text. The tools are machine-learning based, using natural language processing models trained over previously annotated documents. They are maintained at ITU Copenhagen and will always be freely available.

What carries the argument

Machine-learning models trained on previously annotated Danish documents that power the baseline processing tools.

If this is right

Anyone working with Danish text can apply the tools immediately for basic automatic processing.
The tools establish performance baselines against which future Danish NLP methods can be measured.
Free and maintained access removes the need for individual groups to repeat the training process.
The same training approach can be reused whenever new annotated Danish data becomes available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar baseline toolkits could be assembled for other languages that currently lack public NLP resources.
Wider availability of these tools might increase the volume of Danish text data that gets automatically annotated over time.
The models could serve as a foundation for building more specialized Danish applications in domains such as search or translation.

Load-bearing premise

Sufficient quantities of high-quality annotated Danish documents already exist to train the models effectively.

What would settle it

Demonstration that the tools achieve no better than random performance on standard Danish tasks such as part-of-speech tagging or named entity recognition, or failure to release the tools publicly as stated.

read the original abstract

This technical note describes a set of baseline tools for automatic processing of Danish text. The tools are machine-learning based, using natural language processing models trained over previously annotated documents. They are maintained at ITU Copenhagen and will always be freely available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a short technical note releasing baseline Danish NLP tools rather than presenting new research or evaluations.

read the letter

The main point is that this paper announces freely available machine-learning tools for basic Danish text processing such as tokenization and tagging, trained on existing annotated data and hosted at ITU Copenhagen. It commits to keeping them open and maintained, which is the core contribution. That fills a practical gap for a lower-resourced language where ready baselines can help people get started without building everything from scratch. The straightforward approach of applying standard methods to prior annotations is clear and honest about what it is doing. The free availability pledge is also a real service to anyone working with Danish text. On the downside, there are no performance numbers, error analysis, or comparisons to other tools or prior work. The description stays at the level of a general announcement without showing how well the models actually perform or what data was used in detail. This makes it hard to judge whether these are strong baselines or just functional ones. The paper does not claim new methods or derivations, so the lack of quantitative support is consistent with its scope but still leaves the quality unverified from the text alone. This is mainly useful for practitioners or students needing quick Danish components rather than for researchers seeking advances in multilingual NLP techniques. A reading group focused on low-resource languages might skim it for the resource link, but it would not drive much technical discussion. I would not cite it for any methodological result. It deserves peer review in a tools or resources track because the release itself has community value even if the technical depth is modest; a referee could ask for basic metrics without changing the paper's nature.

Referee Report

0 major / 1 minor

Summary. This technical note announces a collection of machine-learning-based baseline NLP tools for Danish text processing. The tools are trained on previously annotated documents and are maintained at ITU Copenhagen with a commitment to free availability.

Significance. If the announced tools are functional and accessible as described, they would offer a practical resource for Danish NLP, a lower-resourced language, by providing ready-to-use baselines for standard tasks and thereby lowering the barrier for subsequent research and applications.

minor comments (1)

The manuscript is extremely brief; expanding the description to explicitly enumerate the covered tasks (e.g., tokenization, tagging) and the underlying annotation corpora would improve utility for potential users without altering the announcement character of the note.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript and the recommendation to accept. The report contains no major comments.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a short technical note announcing the existence of ML-based NLP baseline tools for Danish, trained on existing annotated corpora. It contains no equations, derivations, predictions, fitted parameters, or load-bearing claims that reduce to inputs by construction. No self-citations or uniqueness theorems are invoked. The description is self-contained and factual, with the reader's weakest assumption (data availability) external to any internal reasoning chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical content, free parameters, or new entities are introduced; the paper is a technical note describing tools built from prior annotated data.

pith-pipeline@v0.9.0 · 5538 in / 999 out tokens · 40130 ms · 2026-05-25T14:59:52.176462+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

The Stanford CoreNLP natural language processing toolkit

Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the Association for Computational Linguistics, pages 55–60. Association for Computational Linguistics, 2014

work page 2014
[2]

Incorporating non-local information into infor- mation extraction systems by Gibbs sampling

Jenny Rose Finkel, Trond Grenager, and Christopher Mann ing. Incorporating non-local information into infor- mation extraction systems by Gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 363–370. Association for Computational Linguistic s, 2005

work page 2005
[3]

DKIE: Open source inf ormation extraction for Danish

Leon Derczynski and Kenneth S Bøgh. DKIE: Open source inf ormation extraction for Danish. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 61–64. Association for Computational Linguistics, 2014

work page 2014
[4]

Developing language pro- cessing components with GATE version 8.0

Hamish Cunningham, Diana Maynard, Kalina Bontcheva, V alentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Leo n Derczynski, et al. Developing language pro- cessing components with GATE version 8.0 . University of Shefﬁeld Department of Computer Science, 20 12

work page
[5]

Universal dependencies for danish

Anders Johannsen, Héctor Martínez Alonso, and Barbara P lank. Universal dependencies for danish. In Proc. International W orkshop on Treebanks and Linguistic Theories, page 157, 2015

work page 2015
[6]

Danish dependency treebank

Matthias T Kromann, Line Mikkelsen, and Stine Kern Lynge . Danish dependency treebank. In Proc. Interna- tional W orkshop on Treebanks and Linguistic Theories, pages 217–220, 2003

work page 2003
[7]

Generalised brown clu stering and roll-up feature generation

Leon Derczynski and Sean Chester. Generalised brown clu stering and roll-up feature generation. In Proc. AAAI Conference on Artiﬁcial Intelligence , 2016

work page 2016
[8]

Tune y our brown clustering, please

Leon Derczynski, Sean Chester, and Kenneth S Bøgh. Tune y our brown clustering, please. In Proc. Interna- tional Conference Recent Advances in Natural Language Proc essing, RANLP , volume 2015, pages 110–117. Association for Computational Linguistics, 2015

work page 2015
[9]

Bag of Tricks for Efficient Text Classification

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tom as Mikolov. Bag of tricks for efﬁcient text classiﬁca- tion. arXiv preprint arXiv:1607.01759 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Distant supervision from disparate sources for low-reso urce part-of-speech tagging

Barbara Plank and Željko Agi ´c. Distant supervision from disparate sources for low-reso urce part-of-speech tagging. In Proceedings of the 2018 Conference on Empirical Methods in N atural Language Processing, pages 614–620. Association for Computational Linguistics, 2018

work page 2018
[11]

Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss

Barbara Plank, Anders Søgaard, and Y oav Goldberg. Mult ilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In ACL 2016, arXiv preprint arXiv:1604.05529 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Analysis of named entity recognition and linking for tweets

Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marie ke V an Erp, Genevieve Gorrell, Raphaël Troncy, Jo- hann Petrak, and Kalina Bontcheva. Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2):32–49, 2015

work page 2015
[13]

UDPipe: Tra inable Pipeline for Processing CoNLL-U Files Perform- ing Tokenization, Morphological Analysis, POS Tagging and Parsing

Milan Straka, Jan Hajic, and Jana Straková. UDPipe: Tra inable Pipeline for Processing CoNLL-U Files Perform- ing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proc. LREC, 2016. 3

work page 2016
[14]

Universal dependencies v1: A multilin- gual treebank collection

Joakim Nivre, Marie-Catherine De Marneffe, Filip Gint er, Y oav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveir a, et al. Universal dependencies v1: A multilin- gual treebank collection. In Proceedings of the T enth International Conference on Langu age Resources and Evaluation (LREC 2016) , pages 165...

work page 2016

[1] [1]

The Stanford CoreNLP natural language processing toolkit

Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the Association for Computational Linguistics, pages 55–60. Association for Computational Linguistics, 2014

work page 2014

[2] [2]

Incorporating non-local information into infor- mation extraction systems by Gibbs sampling

Jenny Rose Finkel, Trond Grenager, and Christopher Mann ing. Incorporating non-local information into infor- mation extraction systems by Gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 363–370. Association for Computational Linguistic s, 2005

work page 2005

[3] [3]

DKIE: Open source inf ormation extraction for Danish

Leon Derczynski and Kenneth S Bøgh. DKIE: Open source inf ormation extraction for Danish. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 61–64. Association for Computational Linguistics, 2014

work page 2014

[4] [4]

Developing language pro- cessing components with GATE version 8.0

Hamish Cunningham, Diana Maynard, Kalina Bontcheva, V alentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Leo n Derczynski, et al. Developing language pro- cessing components with GATE version 8.0 . University of Shefﬁeld Department of Computer Science, 20 12

work page

[5] [5]

Universal dependencies for danish

Anders Johannsen, Héctor Martínez Alonso, and Barbara P lank. Universal dependencies for danish. In Proc. International W orkshop on Treebanks and Linguistic Theories, page 157, 2015

work page 2015

[6] [6]

Danish dependency treebank

Matthias T Kromann, Line Mikkelsen, and Stine Kern Lynge . Danish dependency treebank. In Proc. Interna- tional W orkshop on Treebanks and Linguistic Theories, pages 217–220, 2003

work page 2003

[7] [7]

Generalised brown clu stering and roll-up feature generation

Leon Derczynski and Sean Chester. Generalised brown clu stering and roll-up feature generation. In Proc. AAAI Conference on Artiﬁcial Intelligence , 2016

work page 2016

[8] [8]

Tune y our brown clustering, please

Leon Derczynski, Sean Chester, and Kenneth S Bøgh. Tune y our brown clustering, please. In Proc. Interna- tional Conference Recent Advances in Natural Language Proc essing, RANLP , volume 2015, pages 110–117. Association for Computational Linguistics, 2015

work page 2015

[9] [9]

Bag of Tricks for Efficient Text Classification

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tom as Mikolov. Bag of tricks for efﬁcient text classiﬁca- tion. arXiv preprint arXiv:1607.01759 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Distant supervision from disparate sources for low-reso urce part-of-speech tagging

Barbara Plank and Željko Agi ´c. Distant supervision from disparate sources for low-reso urce part-of-speech tagging. In Proceedings of the 2018 Conference on Empirical Methods in N atural Language Processing, pages 614–620. Association for Computational Linguistics, 2018

work page 2018

[11] [11]

Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss

Barbara Plank, Anders Søgaard, and Y oav Goldberg. Mult ilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In ACL 2016, arXiv preprint arXiv:1604.05529 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Analysis of named entity recognition and linking for tweets

Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marie ke V an Erp, Genevieve Gorrell, Raphaël Troncy, Jo- hann Petrak, and Kalina Bontcheva. Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2):32–49, 2015

work page 2015

[13] [13]

UDPipe: Tra inable Pipeline for Processing CoNLL-U Files Perform- ing Tokenization, Morphological Analysis, POS Tagging and Parsing

Milan Straka, Jan Hajic, and Jana Straková. UDPipe: Tra inable Pipeline for Processing CoNLL-U Files Perform- ing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proc. LREC, 2016. 3

work page 2016

[14] [14]

Universal dependencies v1: A multilin- gual treebank collection

Joakim Nivre, Marie-Catherine De Marneffe, Filip Gint er, Y oav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveir a, et al. Universal dependencies v1: A multilin- gual treebank collection. In Proceedings of the T enth International Conference on Langu age Resources and Evaluation (LREC 2016) , pages 165...

work page 2016