Setup & Documentation:

Cross-lingual Candidate Search for Biomedical Concept Normalization

The following documentation is based on the work described in:

Cross-lingual Candidate Search for Biomedical Concept Normalization. Roland Roller, Madeleine Kittner, Dirk Weissenborn, Ulf Leser, In Proceedings of MultilingualBIO, Miyazaki, Japan, 2018

The work describes a biomedical translation model on character level in order to improve candidate search for concept normalization. We trained a model using the multilingual (parallel) data of UMLS in addition to the FreeDict dictionary. For more details, see our paper. We would appreciate if you cited our work.

Note: Basic knowledge about the usage of Solr and Tensorflow is expected and necessary to use our models.

Setting up the Neural Translation Model:

To get started, you require python3, CONDA/Miniconda and pip. The installation was tested on Ubuntu.

First, download our sources and unpack them. To install the necessary libaries, please execute one of the following comands from the folders /CharTranslator/env/conda/ or /Char-Translator/env/pip3/:


conda env create -f environment.yml

The following code has to be executed by activating the new environment:

source activate cross_lingual_candidate_search
conda activate cross_lingual_candidate_search

Using the pre-trained translation models:

The CharTranslator package comes with pre-trained models (CharTranslator/models/) which are described in Roller et al. (2018) which can be directly used. As described in our paper, we trained the following biomedical translation models:

  • Dutch ==> English [du_en]
  • French ==> English [fr_en]
  • German ==> English [de_en]
  • Spanish ==> English [es_en]

The following example shows how to use the model:

# Loading the Model
from model import *
import pickle

sess = tf.InteractiveSession()
with open("PATH TO THE MODEL SETTING FILE config.pickle", "rb") as f:
        conf = pickle.load(f)

model = CharEncoderDecoder.from_config(conf, 124, "/cpu:0")
saver = tf.train.Saver(model.train_variables)
saver.restore(sess, "PATH TO THE MODEL")

model.decode(sess, "STRING TO TRANSLATE", max_length=20)

You can also directly test the models using the test script:

#example for French
python --str "rein" --lang "fr"
# output are five possible translations which might vary: 
# ['kidney', 'kidney', 'kidney', 'kidney', 'kidney']

#example for German
python --str "kopfschmerzen" --lang "de"
# output: ['headache', 'headache', 'headaing', 'headaing', 'headache']

Training new models:

If you like, you can also train your own model. In our work, we used UMLS and Freedict for training, however, you can use any parallel data you like. The input data must be constructed as follows:

General Format:

First, you need to select the source-language and the target-language:

[Word in the source-language] \t [Tuple of words in the target language that have equivalent meaning, seperated with tabs]

A simple example (German ==> English) can be found in CharTranslator/input_example/

In case you intend to follow a similar setup as described in our paper, please download the sources from FreeDict. Moreover, you need to download and convert the data from UMLS MRCONSO.

python3 --train_data FULL_PATH_OF_TRAINING_DATA --dev_data FULL_PATH_OF_DEVELOPMENT_DATA --save_dir FULL_PATH_TO_SAVE_MODEL --encoder_size INTEGER --decoder_size INTEGER

Candidate Search & Normalization:

In the backend of our candidate search, we use Solr 6.5. A good configuration of Solr and its correct application will strongly influence the results. You can find the schema we used for our experiments in the data provided. In order to set up the Solr system, the following steps have to be conducted:

Start Solr:

bin/solr start -f

Create a new core:

bin/solr create -c UMLSMultiLing

Stop Solr.

Replace our schema (CharTranslator/solr/schema) with the schema of the new core in solr-6.5/server/solr/UMLSMultiLing/conf/ Restart Solr

Now you need to move the data into your core. If you have a local UMLS installation, you can directly use the following python script to copy UMLS data of your target language into Solr. For our experiments, we applied the command for ENG, SPA, FRE, GER and DUT. For English, in particular, this process will take some time.

python3 --mrconso /home/rroller/data/UMLS/data/2016AB/META/MRCONSO.RRF --lang ENG --core UMLSMultiLing

The same script can be used to insert the Quaero training data into Solr (in case you want to reproduce our experiments).

python3 --mrconso no --quaero /home/rroller/data/corpora/QUAERO_FrenchMed/corpus/train/MEDLINE/ --lang QUAERO --core UMLSMultiLing

Now restart Solr again.

The following commands are used to access the data:

from function_set import *



resultN_TL, resultSolr_TL = fuzzySearchSolr(mention, SOLR_CORE, target_lang, 10)
print (">", resultN_TL, resultSolr_TL)