Dependency Tree Parser for German Clinical Text

Here you find our dependency tree parser for German clinical text, as well as our fictitious clinical dataset, which were both presented in our KONVENS paper "A Domain-adapted Dependency Parser for German Clinical Text" [1].

Fictitious Clinical Data: The fictitious clinical dataset used in our second experiment (see ch. 4.3 Experiment 2: Extended Experiments) can be downloaded here. It contains discharge summaries of various clinical domains, written using the template-tool Arztbriefmanager, as well as clinical notes from the nephrology domain, written manually. The documents are tokenized and split into sentences and contain Part-of-Speech (PoS) and dependendcy annotations. The documents were pre-labelled using JCoRe [2] in combination with our model before being manually corrected. In order to visualize the annotations, we recommomend to use the brat annotation tool [3].

Clinical Dependency Parser: Our dependency tree parsing model can be downloaded here. Simply replace the default model of Stanford CoreNLP [4] with ours. The following is the parse command, using an example file input.txt:

java -cp stanford-corenlp-3.8.0.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,depparse -Xmx2g -outputFormat conllu -outputExtension .conllu -file input.txt -outputDirectory output/ -depparse.model UD_German_Clinical_retrain_250_0.gz -pos.model german-ud.tagger

The example above uses standard parameters for tokenization, sentence splitting and PoS-tagging, and can be downloaded from the Stanford website. The model UD_German_Clinical_retrain_250_0.gz is based on the default German dependency parsing model UD_German.gz and was re-trained with the neural network dependency tree parser [5] using our clinical data.

We achieved better results using PoS-tags from JCoRe rather than those provided by Stanford CoreNLP. In order to do so, we used a slightly modified version of the CoreNLP which is available here (please unzip). In this version, the data has to be input in tokenized, sentence-split and PoS-tagged format. Please check the input file input_pos.txt. Overall, you can use the tool in the following way:

java -jar POSDependencyParser.jar --input input_pos.txt --model /UD_German_Clinical_retrain_250_0.gz

The output will be stored in .conll format. For the example above, the result can be found in input_pos.conll

How to interpret the output format?

Stanford dependencies provides a representation of grammatical relations between words in a sentence. Stanford dependencies are triplets: name of the relation, governor and dependent.

A plain text file is converted into a sequence of tokens by Stanford CoreNLP, which are printed out one per line. Tokens are numbered consecutively (column 1) as well as provided with further linguistic information (v.i.), e.g. an exemplary output for the ellipse "Keine AP, gute Belastbarkeit":

1KeinePIAT2det
2APNN0root
3,$,5punct
4guteADJA5amod
5BelastbarkeitNN2conj
6.$.2punct

The example sentence provided with syntactic annotation looks as follows:


'No AP, good resilience.'

User receives as output simultaneously the respective Part-of-Speech (PoS)-information (column 3) as well as the associated token with its incoming edge (column 4), - information which allows immediate conclusions to be drawn about the dependencies. The dependencies each have one incoming edge only.

In the example above, token 2 - "AP" (abbreviation for "Angina Pectoris") - is the root-node, it defines the main dependency of the whole sentence. The node has a number of outgoing edges - to the tokens 1, 5 and 6. Connection of the edges is clearly recognizable by the common number 2 (column 4). In addition, there are further relations in the simple sentence: from the token 5 to 3 as well as to 4 (evident by number 5 in the column 4).

The respective outgoing edges are consistently listed in the last - fifth - column. Thus, the outgoing edge from "AP" to "Keine" is expressed by the relation "det"[6] (determiner).

The following easy to use tool is designed for illustration purposes of this type. The tool performs syntactic annotation on the sentence level: http://biomedical.dfki.de/mEx

1. Click the "mEx" button
2. Enter a sentence to be annotated into the window "Input Text" (e.g. "Keine AP, gute Belastbarkeit.")
3. To provide a syntactic analysis, click on the green "Syntax" button below

References:

[1] Elif Kara, Tatjana Zeen, Aleksandra Gabryszak, Klemens Budde, Danilo Schmidt and Roland Roller. 2018. A Domain-adapted Dependency Parser for German Clinical Text. In Proceedings of KONVENS 2018, "The Conference on Natural Language Processing". Vienna, Austria.

[2] Johannes Hellrich, Franz Matthies, Erik Faessler and Udo Hahn. 2015. Sharing models and tools for processing German clinical texts. Studies in Health Technology and Informatics, 210:734-738.

[3] Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou and Jun'ichi Tsujii. 2012. brat: a Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations Session at EACL 2012.

[4] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55-60.

[5] Danqi Chen and Christopher Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of EMNLP 2014, Association for Computational Linguistics, 740-750.

[6] For a complete list of language-specific relations for German, please refer to the UD website: http://universaldependencies.org/de/index.html.