LIRMM: CLEF EHealth 2017 Task 1 Reproduction instructions

LIRMM: CLEF EHealth 2017 Task 1 Reproduction instructionsI. IntroductionI.1. TeamI.2. MethodsI.3. ContactII. PrerequisitesII.1. HardwareII.2. An internet connectionII.3. Java 8II.4. RedisIII. Reproducing the resultsIII.1. Setting up the evironmentIII.2 Running the experimentsIII.2.1. French AlignedIII.2.2. French RawIII.2.3. English RawIII.3. Computing result scoresIII.3.1. French alignedIII.3.2. French rawIII.3.3. English raw Appendix 1. Annotator REST API connection informationA1.1. French EvaluationA1.2. English EvaluationAppendix 2. Building the evaluation programme from sourceA2.1. RequirementsA2.2. Steps

I. Introduction

The LIRMM system uses a dictionary based approach through the SIFR Bioportal Annotator (http://bioportal.lirmm.fr) for the French track and the NCBO Bioportal Annotator (http://bioportal.bioontology.org) for the English track, combined with fallback heuristics.

Using the SIFR annotator is as simple as sending a request through the HTTP REST API, for example:

http://services.bioportal.lirmm.fr/annotator/?text=Absence%20de%20tumeur%20maligne&negation=true&apikey=39b6ff93-0c7c-478b-a2d7-1ad098f01a28&ontologies=WHO-ARTFRE

One can also use the UI for direct on-line annotation:

http://bioportal.lirmm.fr/annotator

Bioportal Annotator annotates text with Ontology class URIs. We annotate each line of the corpora with CIM-10 and ICD10 based ontologies and extract the ICD10/CIM-10 codes from the URIs.

The LIRMM system aims at evaluating the Bioportal Annotator concept recognition performance alone, without any disambiguation. As such a higher recall is the main objective, as a post-recognition disambiguation process increases precision by selecting a subset of the recognized concepts, at the cost of a lower recall.

Reference:

(Temporary French publication) Clement Jonquet, Amina Annane, Khedidja Bouarech, Vincent Emonet & Soumia Melzi. SIFR BioPortal : Un portail ouvert et générique d’ontologies et de terminologies biomédicales françaises au service de l’annotation sémantique, In 16th Journées Francophones d'Informatique Médicale, JFIM'16. Genève, Suisse, July 2016. pp. 16.

I.1. Team

Two post-doctoral researchers at LIRMM, Montpellier:

An engineer (LIRMM, Montpellier):

Supervised by an Assistant Professor (LIRMM, Monpelier and Stanford Center of Biomedical Informatics Research) in the context of the ANR PractikPharma project around the SIFR BioPortal platform:

I.2. Methods

  1. FR Run 1 (Aligned and Raw): Annotation through the SIFR Bioportal Annotator with the CIM-10 French ontology (originating from cismef) combined with a custom built skos vocabulary from the set of dictionaries provided as well as the training corpus (CIM-10DC). The ontology was generated with a heuristic, where labels that correspond to multiple codes are assigned to the most frequent code only. The code distribution was estimated from the training corpus. We wrote a java programme to read the corpora and call the Annotator API in order to annotate the text with ontology classes (for CIM-10, the last component of the URI is the code).
  2. FR Run 2 (Aligned and Raw): A fallback strategy starting from the result file of Method 1, and, for each line without any annotations, takes the annotations from a second run, where the custom skos ontology was built without the most frequent code heuristic (higher recall, slightly lower precision). This is, in essence a late fusion technique, that aims at increasing the recall, while keeping the precision very similar.
  3. EN Run 1 (Raw): Annotation through the SIFR Bioportal Annotator with the a custom built skos vocabulary from the American dictionary provided (ICD10CDC). The ontology was generated with a heuristic, where labels that correspond to multiple codes are assigned to the most frequent code only. The code distribution was estimated from the training corpus. We wrote a java programme to read the corpora and call the Annotator API in order to annotate the text with ontology classes (for ICD10, the last component of the URI is the code).
  4. EN Run 2 (Raw): Annotation through the SIFR Bioportal Annotator with the a custom built skos vocabulary from the American dictionary provided (ICD10CDC), combined with an owl version of ICD10 and ICD10CM (extracted from UMLS). The ontology was generated with a heuristic, where labels that correspond to multiple codes are assigned to the most frequent code only. The code distribution was estimated from the training corpus. We wrote a java programme to read the corpora and call the Annotator API in order to annotate the text with ontology classes (for ICD10, the last component of the URI is the code).

I.3. Contact

The main contact for the reproduction track for the LIRMM system is Andon Tchechmedjiev, you may get in touch with him by email at andon.tchechmedjiev@lirmm.fr.

If you wish may also submit your queries on the issue tracker of the evaluation programme on GitHub:

https://github.com/twktheainur/bpannotatoreval/issues

II. Prerequisites

II.1. Hardware

You need a modern machine with at least 4 GB of RAM with 3GB free just for the execution of the evaluation programme and the redis cache server. The programme will run fine on most 2-core systems with a relativly recent CPU (~ 5-6 years), as the CPU requirements are low.

II.2. An internet connection

The evaluation programme communicates with the SIFR Bioportal Annotator and NCBO Bioportal Annotator through a REST API and requires an active internet connection of reasonable speed to accomodate the data exchange.

Warning: If your internet connection is behind a proxy that requires an authentication, the program will not work even if you have the appropriate system configuration. The default behaviour of the JDK is to ignore the system properties related to proxy authentication. If you find yourself in that situation, please get in touch with us, so that we may produce a version of the evaluation program that will work with your proxy settings.

II.3. Java 8

The evaluation programme is written in Java and requires that you install the Java 8 JRE .

Make sure that the bin directory of the JRE is in the system PATH variable (Path on Windows systems).

II.4. Redis

The evaluation programme uses a Redis cache server to minimise redundent network activity and to allow resuming the annotation in case of network failure. You need to install redis on you machine:

III. Reproducing the results

III.1. Setting up the evironment

We have prepared an archive with the test corpus and the jar to run the LIRMM system. Please use the corpus from the archive, especially for the French corpus, as we needed to correct erros in lines where the text contains semicolons. The raw text was properly escapes for the Americal dataset but not for the French datasets, thus requireing a painstaking manual correction with the help of regular expressions.

  1. Download the archive and decompress it

    http://andon.tchechmedjiev.eu/files/LIRMMSystemReproduction.zip

  2. Open a terminal and go in the decompressed directory. It should be named LIRMMSystemReproduction

  3. Make sure you can run java: java -version it should run and show you 1.8.X as the version number.

The corpus directory contains the following corpora:

The eval directory contains the aligned and raw scorers:

III.2 Running the experiments

Remider: Make sure redis-server is running!

The syntax of the evaluation programme is the following:

 
java -cp bpeval.jar org.pratikpharma.cli.CLEFEHealth2017Task1Evaluator [aligned|raw] /path/to/corpus result.csv cachePrefix annotatorURL apikey [NONE|MFC|CUTOFF|DISAMBIGUATE] [0-1].[0-9]+ ONTOLOGY1 ONTOLOGY2 ... ONTOLOGYN

For all the runs below, if any network connection errors occur or if the server goes down temporarily for some reason, you can simply run the same command again, the annotation will resume from the cache where it had been interrupted.

Depending on your internet connection speed and the load of the servers, the annotation may take up to several hours. In order to streamline the process, you may perform all the French and English runs at the same time in order to accelerate the replication, so that it doesn't take you longer than the alloted 8 hours.

III.2.1. French Aligned


III.2.2. French Raw


III.2.3. English Raw


III.3. Computing result scores

III.3.1. French aligned

The evaluation goldstandard for the test dataset has not been released as o fthe submissions for the replication task, as such it is not included in our archive. Please replace /path/to/test/goldstandardcorpus.csv with the appropriate task.

III.3.2. French raw

III.3.3. English raw

Appendix 1. Annotator REST API connection information

A1.1. French Evaluation

Api-key for the French evaluation:

39b6ff93-0c7c-478b-a2d7-1ad098f01a28

Annotator URL for the French evaluation:

http://services.bioportal.lirmm.fr/annotator

Ontology acronyms used in the evaluation:

CIM-10
CIM-10DC-ALL
CIM-10DC-ALLMFC

A1.2. English Evaluation

Api-key for the English evaluation:

9c9d2054-33f0-4d1f-b545-87255257b56c

Annotator URL for the English evaluation:

http://services.bioportal.lirmm.fr/ncbo_annotatorplus

Ontology acronyms used in the evaluation:

ICD10
ICD10CM
ICD10CDC

Appendix 2. Building the evaluation programme from source

The project can be found on GitHub: https://github.com/twktheainur/bpannotatoreval

A2.1. Requirements

A2.2. Steps

  1. Clone the repository (you need a github account to do so):

     
    git clone https://github.com/twktheainur/bpannotatoreval.git
  2. Go into the directory:

     
    cd bpannotatoreval
  3. Build with maven:

    mvn clean install assembly:assembly

The bpeval.jar file will be created in the target directory, it is interchangable with the jar provided in the replication archive.