MTI ML (Machine Learning Package)

TOOLS

MTI ML (MACHINE LEARNING PACKAGE)

NOTES

The MTI ML package is protected under the MetaMap Terms and Conditions. Please review prior to downloading the MTI ML package.

The MTI ML package provides machine learning algorithms optimized for large text categorization tasks and is able to combine several text categorization solutions. The advantages of this package compared to existing approaches are: 1) its speed, 2) it is able to work with a large number of categorization problems and, 3) it provides the ability to compare several text categorization tools based on meta-learning. This website describes how to download, install and run MTI ML. An example data set is provided to verify the installation of the tool. More detailed instructions on using the tool are available here.

The latest changes in the MTI ML package can be found in the release notes here.

Download

The main components of the MTI ML distribution are easily downloaded via the single MTI_ML.tar.gz link below. MTI ML also requires a third party package monq-1.1.1.jar which is available via the link below.

The monq-1.1.1.jar package is a third party open source development resource package that was originally available from berliOS. The monq-1.1.1.jar package is used to parse XML and provide server capabilities to MTI ML. We have preserved a static copy of the monq-1.1.1.jar file locally.

For best results, download and install the MTI ML distribution file MTI_ML.tar.gz and then download and install the monq-1.1.1.jar package in the new MTI_ML directory that was created by the MTI ML distribution file.

MTI_ML.tar.gz (49 MB)
monq-1.1.1.jar (221 KB) [originally from Third Party berliOS]

JAR files included in the MTI_ML distribution:

mti_prod.jar: contains the core of the categorization algorithms
utils.jar: contains tools to perform sorting

Installation

MTI ML has been compiled and verified to work with java 1.6. If you need to install java or update your current version, follow this link: http//www.java.com. Be sure that the path to the java program is in the PATH environment variable.

Move the downloaded files into a directory where you want to install MTI ML. When you uncompress and untar the MTI ML distribution file it will create a subdirectory called "MTI_ML" and that directory will be referred to as <parent_directory> throughout the rest of the instructions. So, for example, if you create a directory "Project" and install MTI ML in Project then the <parent_directory> should be set to <path to Project>/Project/MTI_ML or <path to Project>\Project\MTI_ML under Windows.

# Windows:

Winrar and winzip can be used to uncompress and untar the files.

# Linux and Mac OS (from Bourne/Bash Shell):

gunzip -c MTI_ML.tar.gz | tar -xf -

NOTE: Now, move the downloaded monq-1.1.1.jar file into the newly created MTI_ML directory before proceeding to Training.

cd MTI_ML

# Windows:

move ..\monq-1.1.1.jar .

# Linux and Mac OS (from Bourne/Bash Shell):

mv ../monq-1.1.1.jar .

The various MTI ML jar files have to be added to the CLASSPATH environment variable or configured directly using the -cp parameter of the Java Virtual Machine (JVM).

Training and testing classifiers using MTI ML

The following section provides a tutorial on how to use the MTI ML. All of the tutorial examples are provided in the three supported operating systems: Windows (XP & 7), Linux, and Mac OS. Before you start, please download the tutorial environment by saving all of the following files to your work area. Between these instructions and the tutorial environment, you should be able to recreate the entire MTI ML sample run from training the machine learning algorithms to evaluating the results.

The tutorial environment consists of sample training and test data sets, a sample configuration file, and a benchmark file with the gold standard annotations for the data sets to evaluate the results.

Please Note: The MEDLINE citations in the training and test data sets and the results in the benchmark file represent a static view of the MEDLINE database at the time the data was created. No attempt has been made to keep the data up-to-date.

Training, testing, and evaluation files included in the MTI_ML distribution:

citations.train.xml.gz: training citations
citations.test.xml.gz: test citations
configuration.txt: configuration file used during training. This file provides the instructions to learn models to categorize citations related to the Humans, Male and Female MeSH headings.
benchmark.test: file used as Gold Standard used to evaluate the performance of the algorithms on the test set

Using MTI ML on Female, Humans, and Male MeSH Headings

Note: In Windows, the files citations.train.xml.gz and citations.test.xml.gz will need to be uncompressed before proceeding. Winrar and winzip can be used to uncompress the files.

Note: Please substitute the full path of the installation directory whenever "<parent>" is used in the instructions.

Move to the Parent Directory

# Windows. Open a Windows command prompt. Go to <parent_directory>.

Move to the drive (e.g. C:) where the <parent_directory> is located

[Drive]:

cd <parent_directory>

# Linux and Mac OS. Open a terminal. Go to <parent_directory>.

cd <parent_directory>

Setting the CLASSPATH environment variable

# Windows:

set CLASSPATH="<parent_directory>\monq-1.1.1.jar;<parent_directory>\utils.jar;<parent_directory>\mti_prod.jar"

# Linux and Mac OS:

* in C Shell (csh or tcsh)

setenv CLASSPATH <parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar

* in Bourne Again Shell (bash)

export CLASSPATH=<parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar

* Bourne Shell (sh)

CLASSPATH=<parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar export CLASSPATH

Training

To train the classifier. The training tool will generate a dictionary file stored in a trie structure (stored in trie.gz) and a set of models (stored in classifiers.gz) based on the definition in the configuration.txt file.

The file configuration.txt contains the details of the training. In this case, we are training models for the Humans, Male, and Female MeSH headings.

Details of the training are sent to the standard output. In the example, this is redirected to out.txt and can be used to follow the training progress and understand the generated model.

The training will take several minutes. Check the file out.log to ensure that there were no errors during training.

# Windows:

type citations.train.xml | java -cp %CLASSPATH% -Xmx1G -Xms1G -ss6000k gov.nih.nlm.nls.mti.trainer.OVATrainer gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c -f1" configuration.txt trie.gz classifiers.gz 2> out.log > out.txt

# Linux and Mac OS (from Bourne/Bash Shell):

sh

gunzip -c citations.train.xml.gz | java -cp $CLASSPATH -Xmx1G -Xms1G -ss6000k gov.nih.nlm.nls.mti.trainer.OVATrainer gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c -f1" configuration.txt trie.gz classifiers.gz 2> out.log > out.txt

Testing

Given the trained model, we can then annotate a new set of citations. In the examples below, the outcome is stored in the annotation.txt file.

Annotating will take several minutes. Check the file annotation.log to ensure that there were no errors during annotating.

# Windows:

type citations.test.xml | java -ss6000k -cp %CLASSPATH% gov.nih.nlm.nls.mti.annotator.OVAAnnotator gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c" trie.gz classifiers.gz > annotation.txt 2> annotation.log

# Linux and Mac OS (from Bourne/Bash Shell):

gunzip -c citations.test.xml.gz | java -ss6000k -cp $CLASSPATH gov.nih.nlm.nls.mti.annotator.OVAAnnotator gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c" trie.gz classifiers.gz > annotation.txt 2> annotation.log

Evaluation

The annotation can be evaluated using the following script. The file benchmark.test contains the MEDLINE MeSH indexing for each one of the citations in the test set and it is used as the Gold Standard. The output is stored in the file benchmark.txt.

# Windows:

type annotation.txt | java -cp %CLASSPATH% gov.nih.nlm.nls.mti.evaluator.Evaluator benchmark.test > benchmark.txt

# Linux and Mac OS (from Bourne/Bash Shell):

cat annotation.txt | java -cp $CLASSPATH gov.nih.nlm.nls.mti.evaluator.Evaluator benchmark.test > benchmark.txt

grep "^Female|" benchmark.txt
grep "^Humans|" benchmark.txt
grep "^Male|" benchmark.txt

The evaluation file benchmark.txt contains results for all the MeSH headings. Each line shows the result for a single MeSH heading with fields separated by the pipe symbol. The first field is the MeSH heading name, then the number of positives in the test set, true positives, the false negatives, precision, recall, and F-measure.

The result for the MeSH headings Humans, Male and Female should be similar to these results:

Female | 4616 | 3190 | 907 | 0.7786185013424457 | 0.6910745233968805 | 0.7322391828302537
Humans | 7688 | 7081 | 775 | 0.9013492871690427 | 0.9210457856399584 | 0.9110910962429233
Male | 4396 | 2994 | 968 | 0.7556789500252398 | 0.681073703366697 | 0.7164393395549175

Now that you have managed to train and evaluate classifiers based on the example data set, you can learn more about this tool from the document available here.

Publications and Resources

Information & Resources COVID-19 Related Resources from Indexing Initiative About Indexing Initiative Datasets & Test Collections MEDLINE Baseline Statistical Reports (BSD link) MEDLINE Co-Occurrence File (MRCOC) MEDLINE Baseline Repository (MBR) MetaMapped MEDLINE Baselines MUID to PMID Conversion Files Publications Tools Terms of Service New! Access to our Tools Batch Access Interactive Access Web API Access Medical Text Indexer (MTI) Phrase2MeSH MeSH on Demand (MeSH link) MetaMap MetaMapLite New! Custom Taxonomy Builder New! MTI ML (Machine Learning Package) Specialist Lexicon Information and Tools Semantic Knowledge Representation Web Site (Retired) SemRep/SemMedDB Access Web Site Areas of Interest Full Text Processing Structured Abstracts Word Sense Disambiguation (WSD) NLM Medical Text Indexer (MTI) The NLM Medical Text Indexer (MTI) combines human NLM Index Section expertise and Natural Language Processing technology to curate the biomedical literature more efficiently and consistently. MTI is the main product of the Indexing Initiative project and has been providing indexing recommendations based on the Medical Subject Headings (MeSH®) vocabulary since 2002. In 2011, NLM expanded MTI's role by designating it as the first-line indexer (MTIFL) for a few journals; today the MTIFL workflow includes over 350 journals and continues to increase. The close collaboration of the NLM Index Section, Lister Hill National Center for Biomedical Communications, and Office of Computer & Communications Systems continues to expand and refine the ability of MTI to provide assistance to the indexers.

TOOLS