Contact Us   
  Indexing Initiative > MTI ML
The MTI ML package provides machine learning algorithms optimized for large text categorization tasks and is able to combine several text categorization solutions. The advantages of this package compared to existing approaches are: 1) its speed, 2) it is able to work with a large number of categorization problems and, 3) it provides the ability to compare several text categorization tools based on meta-learning. This website describes how to download, install and run MTI ML. An example data set is provided to verify the installation of the tool. More detailed instructions on using the tool are available here.

The latest changes in the MTI ML package can be found in the release notes here.

Download

The main components of the MTI ML distribution are easily downloaded via the single MTI_ML.tar.gz link below. MTI ML also requires a third party package monq-1.1.1.jar which is available via the link below.

The monq-1.1.1.jar package is a third party open source development resource package that was originally available from berliOS. The monq-1.1.1.jar package is used to parse XML and provide server capabilities to MTI ML. We have preserved a static copy of the monq-1.1.1.jar file locally.

For best results, download and install the MTI ML distribution file MTI_ML.tar.gz and then download and install the monq-1.1.1.jar package in the new MTI_ML directory that was created by the MTI ML distribution file.


JAR files included in the MTI_ML distribution:

Installation

MTI ML has been compiled and verified to work with java 1.6. If you need to install java or update your current version, follow this link: http//www.java.com. Be sure that the path to the java program is in the PATH environment variable.

Move the downloaded files into a directory where you want to install MTI ML. When you uncompress and untar the MTI ML distribution file it will create a subdirectory call "MTI_ML" and that directory will be referred to as <parent_directory> throughout the rest of the instructions. So, for example, if you create a directory "Project" and install MTI ML in Project then the <parent_directory> should be set to <path to Project>/Project/MTI_ML or <path to Project>\Project\MTI_ML under Windows.

# Windows:

Winrar and winzip can be used to uncompress and untar the files.

# Linux and Mac OS (from Bourne/Bash Shell):

gunzip -c MTI_ML.tar.gz | tar -xf -

NOTE: Now, move the downloaded monq-1.1.1.jar file into the newly created MTI_ML directory before proceeding to Training.
cd MTI_ML

# Windows:

move ..\monq-1.1.1.jar .

# Linux and Mac OS (from Bourne/Bash Shell):

mv ../monq-1.1.1.jar .

The various MTI ML jar files have to be added to the CLASSPATH environment variable or configured directly using the -cp parameter of the Java Virtual Machine (JVM).

Training and testing classifiers using MTI ML

The following section provides a tutorial on how to use the MTI ML. All of the tutorial examples are provided in the three supported operating systems: Windows (XP & 7), Linux, and Mac OS. Before you start, please download the tutorial environment by saving all of the following files to your work area. Between these instructions and the tutorial environment, you should be able to recreate the entire MTI ML sample run from training the machine learning algorithms to evaluating the results.

The tutorial environment consists of sample training and test data sets, a sample configuration file, and a benchmark file with the gold standard annotations for the data sets to evaluate the results.

Please Note: The MEDLINE citations in the training and test data sets and the results in the benchmark file represent a static view of the MEDLINE database at the time the data was created. No attempt has been made to keep the data up-to-date.

Training, testing, and evaluation files included in the MTI_ML distribution:

Using MTI ML on Female, Humans, and Male MeSH Headings


Note: In Windows, the files citations.train.xml.gz and citations.test.xml.gz will need to be uncompressed before proceeding.  Winrar and winzip can be used to uncompress the files.

Note: Please substitute the full path of the installation directory whenever "<parent>" is used in the instructions.

Move to the Parent Directory

# Windows. Open a Windows command prompt. Go to <parent_directory>.

Move to the drive (e.g. C:) where the <parent_directory> is located

[Drive]:

cd <parent_directory>

# Linux and Mac OS. Open a terminal. Go to <parent_directory>.

cd <parent_directory>

Setting the CLASSPATH environment variable

# Windows:

set CLASSPATH="<parent_directory>\monq-1.1.1.jar;<parent_directory>\utils.jar;<parent_directory>\mti_prod.jar"

# Linux and Mac OS:

* in C Shell (csh or tcsh)

setenv CLASSPATH <parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar

* in Bourne Again Shell (bash)

export CLASSPATH=<parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar

* Bourne Shell (sh)

CLASSPATH=<parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar
export CLASSPATH

Training

To train the classifier. The training tool will generate a dictionary file stored in a trie structure (stored in trie.gz) and a set of models (stored in classifiers.gz) based on the definition in the configuration.txt file.

The file configuration.txt contains the details of the training. In this case, we are training models for the Humans, Male, and Female MeSH headings.

Details of the training are sent to the standard output. In the example, this is redirected to out.txt and can be used to follow the training progress and understand the generated model.

The training will take several minutes. Check the file out.log to ensure that there were no errors during training.

# Windows:

type citations.train.xml | java -cp %CLASSPATH% -Xmx1G -Xms1G -ss6000k gov.nih.nlm.nls.mti.trainer.OVATrainer gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c -f1" configuration.txt trie.gz classifiers.gz 2> out.log > out.txt

# Linux and Mac OS (from Bourne/Bash Shell):

sh

gunzip -c citations.train.xml.gz | java -cp $CLASSPATH -Xmx1G -Xms1G -ss6000k gov.nih.nlm.nls.mti.trainer.OVATrainer gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c -f1" configuration.txt trie.gz classifiers.gz 2> out.log > out.txt


Testing

Given the trained model, we can then annotate a new set of citations. In the examples below, the outcome is stored in the annotation.txt file.

Annotating will take several minutes. Check the file annotation.log to ensure that there were no errors during annotating.

# Windows:

type citations.test.xml | java -ss6000k -cp %CLASSPATH% gov.nih.nlm.nls.mti.annotator.OVAAnnotator gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c" trie.gz classifiers.gz > annotation.txt 2> annotation.log

# Linux and Mac OS (from Bourne/Bash Shell):

gunzip -c citations.test.xml.gz | java -ss6000k -cp $CLASSPATH gov.nih.nlm.nls.mti.annotator.OVAAnnotator gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c" trie.gz classifiers.gz > annotation.txt 2> annotation.log


Evaluation

The annotation can be evaluated using the following script. The file benchmark.test contains the MEDLINE MeSH indexing for each one of the citations in the test set and it is used as the Gold Standard. The output is stored in the file benchmark.txt.

# Windows:

type annotation.txt | java -cp %CLASSPATH% gov.nih.nlm.nls.mti.evaluator.Evaluator benchmark.test > benchmark.txt

# Linux and Mac OS (from Bourne/Bash Shell):

cat annotation.txt | java -cp $CLASSPATH gov.nih.nlm.nls.mti.evaluator.Evaluator benchmark.test > benchmark.txt

grep "^Female|" benchmark.txt
grep "^Humans|" benchmark.txt
grep "^Male|" benchmark.txt

The evaluation file benchmark.txt contains results for all the MeSH headings. Each line shows the result for a single MeSH heading with fields separated by the pipe symbol. The first field is the MeSH heading name, then the number of positives in the test set, true positives, the false negatives, precision, recall, and F-measure.

The result for the MeSH headings Humans, Male and Female should be similar to these results:

Female|4616|3190|907|0.7786185013424457|0.6910745233968805|0.7322391828302537
Humans|7688|7081|775|0.9013492871690427|0.9210457856399584|0.9110910962429233
Male|4396|2994|968|0.7556789500252398|0.681073703366697|0.7164393395549175

Now that you have managed to train and evaluate classifiers based on the example data set, you can learn more about this tool from the document available here.

Publications and Resources


Copyright, Privacy, Accessibility, Viewers and Players,
Freedom of Information Act, Contact Us
Last Modified: June 26, 2015   
link to https://www.usa.gov/ - image is USA.gov logo link to https://www.hhs.gov - image is HHS.gov logo link to https://www.nih.gov - image is NIH.gov logo link to https://www.nlm.nih.gov - image spells out U.S. National Library of Medicine