Setup & Documentation:
Cross-lingual Candidate Search for Biomedical Concept Normalization The following documentation is based on the work described in:
Cross-lingual Candidate Search for Biomedical Concept Normalization. Roland Roller, Madeleine Kittner, Dirk Weissenborn, Ulf Leser, In Proceedings of MultilingualBIO, Miyazaki, Japan, 2018
The work describes a biomedical translation model on character level in order to improve candidate search for concept normalization. We trained a model using the multilingual (parallel) data of UMLS in addition to the FreeDict dictionary. For more details, see our paper. We would appreciate if you cited our work.
Note: Basic knowledge about the usage of Solr and Tensorflow is expected and necessary to use our models.
Setting up the Neural Translation Model:
To get started, you require python3, CONDA/Miniconda and pip. The installation was tested on Ubuntu.
First, download our sources and unpack them. To install the necessary libaries, please execute one of the following comands from the folders /CharTranslator/env/conda/ or /Char-Translator/env/pip3/:
The following code has to be executed by activating the new environment:
Using the pre-trained translation models:
The CharTranslator package comes with pre-trained models (CharTranslator/models/) which are described in Roller et al. (2018) which can be directly used. As described in our paper, we trained the following biomedical translation models:
- Dutch ==> English [du_en]
- French ==> English [fr_en]
- German ==> English [de_en]
- Spanish ==> English [es_en]
The following example shows how to use the model:
# Loading the Model from model import * import pickle sess = tf.InteractiveSession() with open("PATH TO THE MODEL SETTING FILE config.pickle", "rb") as f: conf = pickle.load(f) model = CharEncoderDecoder.from_config(conf, 124, "/cpu:0") sess.run(tf.global_variables_initializer()) saver = tf.train.Saver(model.train_variables) saver.restore(sess, "PATH TO THE MODEL") model.decode(sess, "STRING TO TRANSLATE", max_length=20)
You can also directly test the models using the test script:
#example for French python test_translator.py --str "rein" --lang "fr" # output are five possible translations which might vary: # ['kidney', 'kidney', 'kidney', 'kidney', 'kidney'] #example for German python test_translator.py --str "kopfschmerzen" --lang "de" # output: ['headache', 'headache', 'headaing', 'headaing', 'headache']
Training new models:
If you like, you can also train your own model. In our work, we used UMLS and Freedict for training, however, you can use any parallel data you like. The input data must be constructed as follows:General Format:
First, you need to select the source-language and the target-language:
[Word in the source-language]
\t [Tuple of words in the target language that have equivalent meaning, seperated with tabs]
A simple example (German ==> English) can be found in CharTranslator/input_example/
In case you intend to follow a similar setup as described in our paper, please download the sources from FreeDict. Moreover, you need to download and convert the data from UMLS MRCONSO.
Candidate Search & Normalization:In the backend of our candidate search, we use Solr 6.5. A good configuration of Solr and its correct application will strongly influence the results. You can find the schema we used for our experiments in the data provided. In order to set up the Solr system, the following steps have to be conducted:
Create a new core:
Replace our schema (CharTranslator/solr/schema) with the schema of the new core in solr-6.5/server/solr/UMLSMultiLing/conf/ Restart Solr
Now you need to move the data into your core. If you have a local UMLS installation, you can directly use the following python script to copy UMLS data of your target language into Solr. For our experiments, we applied the command for ENG, SPA, FRE, GER and DUT. For English, in particular, this process will take some time.
The same script can be used to insert the Quaero training data into Solr (in case you want to reproduce our experiments).
Now restart Solr again.
The following commands are used to access the data:
from function_set import * SOLR_CORE="UMLSMultiLing" target_lang="GER" mention="Krebs" resultN_TL, resultSolr_TL = fuzzySearchSolr(mention, SOLR_CORE, target_lang, 10) print (">", resultN_TL, resultSolr_TL)