LIRMM: CLEF EHealth 2017 Task 1 Reproduction instructionsI. IntroductionI.1. TeamI.2. MethodsI.3. ContactII. PrerequisitesII.1. HardwareII.2. An internet connectionII.3. Java 8II.4. RedisIII. Reproducing the resultsIII.1. Setting up the evironmentIII.2 Running the experimentsIII.2.1. French AlignedIII.2.2. French RawIII.2.3. English RawIII.3. Computing result scoresIII.3.1. French alignedIII.3.2. French rawIII.3.3. English raw Appendix 1. Annotator REST API connection informationA1.1. French EvaluationA1.2. English EvaluationAppendix 2. Building the evaluation programme from sourceA2.1. RequirementsA2.2. Steps
The LIRMM system uses a dictionary based approach through the SIFR Bioportal Annotator (http://bioportal.lirmm.fr) for the French track and the NCBO Bioportal Annotator (http://bioportal.bioontology.org) for the English track, combined with fallback heuristics.
Using the SIFR annotator is as simple as sending a request through the HTTP REST API, for example:
One can also use the UI for direct on-line annotation:
http://bioportal.lirmm.fr/annotator
Bioportal Annotator annotates text with Ontology class URIs. We annotate each line of the corpora with CIM-10 and ICD10 based ontologies and extract the ICD10/CIM-10 codes from the URIs.
The LIRMM system aims at evaluating the Bioportal Annotator concept recognition performance alone, without any disambiguation. As such a higher recall is the main objective, as a post-recognition disambiguation process increases precision by selecting a subset of the recognized concepts, at the cost of a lower recall.
Reference:
(Temporary French publication) Clement Jonquet, Amina Annane, Khedidja Bouarech, Vincent Emonet & Soumia Melzi. SIFR BioPortal : Un portail ouvert et générique d’ontologies et de terminologies biomédicales françaises au service de l’annotation sémantique, In 16th Journées Francophones d'Informatique Médicale, JFIM'16. Genève, Suisse, July 2016. pp. 16.
Two post-doctoral researchers at LIRMM, Montpellier:
An engineer (LIRMM, Montpellier):
Supervised by an Assistant Professor (LIRMM, Monpelier and Stanford Center of Biomedical Informatics Research) in the context of the ANR PractikPharma project around the SIFR BioPortal platform:
The main contact for the reproduction track for the LIRMM system is Andon Tchechmedjiev, you may get in touch with him by email at andon.tchechmedjiev@lirmm.fr.
If you wish may also submit your queries on the issue tracker of the evaluation programme on GitHub:
https://github.com/twktheainur/bpannotatoreval/issues
You need a modern machine with at least 4 GB of RAM with 3GB free just for the execution of the evaluation programme and the redis cache server. The programme will run fine on most 2-core systems with a relativly recent CPU (~ 5-6 years), as the CPU requirements are low.
The evaluation programme communicates with the SIFR Bioportal Annotator and NCBO Bioportal Annotator through a REST API and requires an active internet connection of reasonable speed to accomodate the data exchange.
Warning: If your internet connection is behind a proxy that requires an authentication, the program will not work even if you have the appropriate system configuration. The default behaviour of the JDK is to ignore the system properties related to proxy authentication. If you find yourself in that situation, please get in touch with us, so that we may produce a version of the evaluation program that will work with your proxy settings.
The evaluation programme is written in Java and requires that you install the Java 8 JRE .
Make sure that the bin directory of the JRE is in the system PATH variable (Path on Windows systems).
macOS installation instructions:
With the Oracle installer https://docs.oracle.com/javase/9/install/installation-jdk-and-jre-macos.htm
With Homebrew:
$ brew update
$ brew tap caskroom/cask
$ brew install Caskroom/cask/java
Linux instalation instruction:
Official Oracle instructions:
https://docs.oracle.com/javase/9/install/installation-jdk-and-jre-linux-platforms.htm
On Debian:
On Ubuntu:
https://www.unixmen.com/installing-java-jrejdk-ubuntu-16-04/
On Fedora/CentOs:
On ArchLinux:
Windows installation instructions:
Official oracle dicumentation: https://docs.oracle.com/javase/9/install/installation-jdk-and-jre-microsoft-windows-platforms.htm
The evaluation programme uses a Redis cache server to minimise redundent network activity and to allow resuming the annotation in case of network failure. You need to install redis on you machine:
On macOS, you can use home-brew to install redis.
First install home-brew if you don't already have it. Open a terminal and run:
$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Install redis:
$ brew install redis
Run redis by opening a terminal and typing:
$ redis-server
Redis will run in the foreground, please leave the window open, unless you want to close redis. You can interrupt the execution with Ctrl + C. Redis will create a database file named dump.rdb, you may delete the file after you have finished with the reprodction. Please use Redis's default port: 6379.
On Linux:
Debian or Ubuntu:
https://www.digitalocean.com/community/tutorials/how-to-install-and-use-redis
Fedora/CentOs:
ArchLinux:
On other distributions, the instructions are similar to the Debian or Fedora instructions, except for the installation of the dependencies: make sure you have build tools installed (gcc toolchain with automake/autoheader) as well as tcl.
On Windows:
We have prepared an archive with the test corpus and the jar to run the LIRMM system. Please use the corpus from the archive, especially for the French corpus, as we needed to correct erros in lines where the text contains semicolons. The raw text was properly escapes for the Americal dataset but not for the French datasets, thus requireing a painstaking manual correction with the help of regular expressions.
Download the archive and decompress it
http://andon.tchechmedjiev.eu/files/LIRMMSystemReproduction.zip
Open a terminal and go in the decompressed directory. It should be named LIRMMSystemReproduction
Make sure you can run java: java -version
it should run and show you 1.8.X as the version number.
The corpus directory contains the following corpora:
The eval directory contains the aligned and raw scorers:
Remider: Make sure redis-server is running!
The syntax of the evaluation programme is the following:
java -cp bpeval.jar org.pratikpharma.cli.CLEFEHealth2017Task1Evaluator [aligned|raw] /path/to/corpus result.csv cachePrefix annotatorURL apikey [NONE|MFC|CUTOFF|DISAMBIGUATE] [0-1].[0-9]+ ONTOLOGY1 ONTOLOGY2 ... ONTOLOGYN
For all the runs below, if any network connection errors occur or if the server goes down temporarily for some reason, you can simply run the same command again, the annotation will resume from the cache where it had been interrupted.
Depending on your internet connection speed and the load of the servers, the annotation may take up to several hours. In order to streamline the process, you may perform all the French and English runs at the same time in order to accelerate the replication, so that it doesn't take you longer than the alloted 8 hours.
French Run 1 (aligned):
java -cp bpeval.jar org.pratikpharma.cli.CLEFEHealth2017Task1Evaluator aligned corpus/FR/aligned/AlignedCauses_2014test.csv FR_aligned_run1.csv FR_aligned_run1 http://services.bioportal.lirmm.fr/annotator 39b6ff93-0c7c-478b-a2d7-1ad098f01a28 NONE 0.0 CIM-10DC-ALLMFC CIM-10
French Run 2 (aligned):
First you need to run the evaluation with another ontology (without filtering heuristics) and then run the fallback program to merge the annotations from run with this new run.
java -cp bpeval.jar org.pratikpharma.cli.CLEFEHealth2017Task1Evaluator aligned corpus/FR/aligned/AlignedCauses_2014test.csv FR_aligned_run_all.csv FR_aligned_run_all http://services.bioportal.lirmm.fr/annotator 39b6ff93-0c7c-478b-a2d7-1ad098f01a28 NONE 0.0 CIM-10DC-ALL CIM-10
java -cp bpeval.jar org.pratikpharma.cli.ClefEHealth2017T1ResultFallbackAligned FR_aligned_run1.csv FR_aligned_run_all.csv FR_aligned_run2.csv
French Run 1 (raw):
java -cp bpeval.jar org.pratikpharma.cli.CLEFEHealth2017Task1Evaluator raw corpus/FR/raw/CausesBrutes_FR_test2014.csv FR_raw_run1.csv FR_aligned_run1 http://services.bioportal.lirmm.fr/annotator 39b6ff93-0c7c-478b-a2d7-1ad098f01a28 NONE 0.0 CIM-10DC-ALLMFC CIM-10
Please note that the cache key used is the same as for the aligned evaluation, whih is not an error. Given that there is no difference in the text itself and that our system only used the RawText field, only the number of fields in the output changes between the two tasks. Thus, by using the same cache key, we load all the same annotations from the cache and just write the result in the appropriate format, which takes mere seconds as opposed to more than an hour.
French Run 2 (raw):
First you need to run the evaluation with another ontology (without filtering heuristics during dictionary construction) and then run the fallback program to merge the annotations from run with this new run.
java -cp bpeval.jar org.pratikpharma.cli.CLEFEHealth2017Task1Evaluator raw corpus/FR/raw/CausesBrutes_FR_test2014.csv FR_raw_run_all.csv FR_aligned_run_all http://services.bioportal.lirmm.fr/annotator 39b6ff93-0c7c-478b-a2d7-1ad098f01a28 NONE 0.0 CIM-10DC-ALL CIM-10
java -cp bpeval.jar org.pratikpharma.cli.ClefEHealth2017T1ResultFallbackRaw FR_raw_run1.csv FR_raw_run_all.csv FR_raw_run2.csv
English Run 1 (raw):
java -cp bpeval.jar org.pratikpharma.cli.CLEFEHealth2017Task1Evaluator raw corpus/EN/raw/CausesBrutes_EN_test.csv EN_raw_run1.csv EN_raw_run1 http://services.bioportal.lirmm.fr/ncbo_annotatorplus 9c9d2054-33f0-4d1f-b545-87255257b56c NONE 0.0 ICD10DCD
English Run 2 (raw):
First you need to run the evaluation with another ontology (without filtering heuristics during dictionary construction) and then run the fallback program to merge the annotations from run with this new run.
java -cp bpeval.jar org.pratikpharma.cli.CLEFEHealth2017Task1Evaluator raw corpus/EN/raw/CausesBrutes_EN_test.csv EN_raw_run_all.csv EN_raw_run_all http://services.bioportal.lirmm.fr/ncbo_annotatorplus 9c9d2054-33f0-4d1f-b545-87255257b56c NONE 0.0 ICD10CDC ICD10CM ICD10
java -cp bpeval.jar org.pratikpharma.cli.ClefEHealth2017T1ResultFallbackRaw EN_raw_run1.csv EN_raw_run_all.csv EN_raw_run2.csv
The evaluation goldstandard for the test dataset has not been released as o fthe submissions for the replication task, as such it is not included in our archive. Please replace /path/to/test/goldstandardcorpus.csv
with the appropriate task.
Run 1:
$ perl ./eval/clefehealth2017Task1eval.pl /path/to/test/goldstandardcorpus.csv FR_aligned_run1.csv
Run 2:
$ perl ./eval/clefehealth2017Task1eval.pl /path/to/test/goldstandardcorpus.csv FR_aligned_run2.csv
Run 1:
$ perl eval/clefehealthTask12017_plainCertifeval.pl /path/to/test/goldstandardcorpus.csv FR_raw_run1.csv
Run 2:
$ perl eval/clefehealthTask12017_plainCertifeval.pl /path/to/test/goldstandardcorpus.csv FR_raw_run2.csv
Run 1:
$ perl eval/clefehealthTask12017_plainCertifeval.pl /path/to/test/goldstandardcorpus.csv EN_raw_run1.csv
Run 2:
$ perl eval/clefehealthTask12017_plainCertifeval.pl /path/to/test/goldstandardcorpus.csv EN_raw_run2.csv
Api-key for the French evaluation:
39b6ff93-0c7c-478b-a2d7-1ad098f01a28
Annotator URL for the French evaluation:
http://services.bioportal.lirmm.fr/annotator
Ontology acronyms used in the evaluation:
CIM-10 CIM-10DC-ALL CIM-10DC-ALLMFC
Api-key for the English evaluation:
9c9d2054-33f0-4d1f-b545-87255257b56c
Annotator URL for the English evaluation:
http://services.bioportal.lirmm.fr/ncbo_annotatorplus
Ontology acronyms used in the evaluation:
ICD10 ICD10CM ICD10CDC
The project can be found on GitHub: https://github.com/twktheainur/bpannotatoreval
You need to have installed Java 1.8 JDK
You need to install Apache Maven:
Clone the repository (you need a github account to do so):
git clone https://github.com/twktheainur/bpannotatoreval.git
Go into the directory:
cd bpannotatoreval
Build with maven:
mvn clean install assembly:assembly
The bpeval.jar file will be created in the target directory, it is interchangable with the jar provided in the replication archive.