Dependency Tree Parser for German Clinical Text

Here you find our dependency tree parser for German clinical text, as well as our fictitious clinical dataset, which were both presented in our KONVENS paper "A Domain-adapted Dependency Parser for German Clinical Text" [1].

Fictitious Clinical Data: The fictitious clinical dataset used in our second experiment (see ch. 4.3 Experiment 2: Extended Experiments) can be downloaded here. It contains discharge summaries of various clinical domains, written using the template-tool Arztbriefmanager, as well as clinical notes from the nephrology domain, written manually. The documents are tokenized and split into sentences and contain Part-of-Speech (PoS) and dependendcy annotations. The documents were pre-labelled using JCoRe [2] in combination with our model before being manually corrected. In order to visualize the annotations, we recommomend to use the brat annotation tool [3].

Clinical Dependency Parser: Our dependency tree parsing model can be downloaded here. Simply replace the default model of Stanford CoreNLP [4] with ours. The following is the parse command, using an example file input.txt:

java -cp stanford-corenlp-3.8.0.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,depparse -Xmx2g -outputFormat conllu -outputExtension .conllu -file input.txt -outputDirectory output/ -depparse.model UD_German_Clinical_retrain_250_0.gz -pos.model german-ud.tagger

The example above uses standard parameters for tokenization, sentence splitting and PoS-tagging, and can be downloaded from the Stanford website. The model UD_German_Clinical_retrain_250_0.gz is based on the default German dependency parsing model UD_German.gz and was re-trained with the neural network dependency tree parser [5] using our clinical data.

We achieved better results using PoS-tags from JCoRe rather than those provided by Stanford CoreNLP. In order to do so, we used a slightly modified version of the CoreNLP which is available here (please unzip). In this version, the data has to be input in tokenized, sentence-split and PoS-tagged format. Please check the input file input_pos.txt. Overall, you can use the tool in the following way:

java -jar POSDependencyParser.jar --input input_pos.txt --model /UD_German_Clinical_retrain_250_0.gz

The output will be stored in .conll format. For the example above, the result can be found in input_pos.conll

References:

[1] Elif Kara, Tatjana Zeen, Aleksandra Gabryszak, Klemens Budde, Danilo Schmidt and Roland Roller. 2018. A Domain-adapted Dependency Parser for German Clinical Text. In Proceedings of KONVENS 2018, "The Conference on Natural Language Processing". Vienna, Austria.

[2] Johannes Hellrich, Franz Matthies, Erik Faessler and Udo Hahn. 2015. Sharing models and tools for processing German clinical texts. Studies in Health Technology and Informatics, 210:734-738.

[3] Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou and Jun'ichi Tsujii. 2012. brat: a Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations Session at EACL 2012.

[4] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55-60.

[5] Danqi Chen and Christopher Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of EMNLP 2014, Association for Computational Linguistics, 740-750.