Technical-Laymen Corpus (TLC)

The Technical-Laymen Corpus (TLC) is an annotated data set based on texts from Med1.de. Med1 is a German patient forum that provides a large variety of health related topics. Users are non-professionals who seek for exchange, opinions and advice. Med1 is freely accessible and the discussions can be read without being registered. A registration is necessary to participate in the discussion. The operating team of Med1 does not provide medical consultation, however they guide the community in terms of netiquette. The users are anonymous and only their user names are known to us.

Two subforums were used, kidney diseases and stomach and intestines. Each subforum provides a variety of user questions ("threads"), each containing a varying number of corresponding answers ("posts"). We used a webcrawler (Scrapy [1]) in order to collect every post of both subforums, including the time of posting, the author's nickname and the thread title. As the data does not contain any personal information, we have the permission of Med1 to share the corpus with the scientific community.

Kidney ForumStomach and Intestines Forum
Date of Crawling 05.11.201810.01.2019
Number of crawled posts9.516219.404
Number of corpus entries20002000

The annotation involves two different concepts: (1) lay expressions and (2) technical terms. Regarding that information we mainly focus on symptoms, diseases, as well as treatments and examinations. However annotators were free to also label information that goes beyond the focus information (e.g. body parts, medication). In addition to the concept label the counterpart synonym or explanation is given as free text. The annotation has been carried out by two medical students within various iterations using the brat3 annotator tool [2].


'Figure 1: Text with annotated concepts.'


'Figure 2: Annotation menu.'

The dataset can be downloaded here. Further information can be found in our LREC paper [3].

References:

[1] Kouzis-Loukas, Dimitrios. Learning scrapy. Packt Publishing Ltd, 2016.

[2] Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou and Jun'ichi Tsujii. 2012. brat: a Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations Session at EACL 2012.

[3] Laura Seiffe, Oliver Marten, Michael Mikhailov, Sven Schmeier, Sebastian Möller and Roland Roller. From Witch's Shot to Music Making Bones - Resources for Medical Laymen to Technical Language and Vice Versa. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, 2020.