Datasets - Indexing Initiative

INFORMATION & RESOURCES

Datasets & Test Collections

This page contains Datasets and Test Collections related to the various research projects and papers associated with the Indexing Initiative project. In some cases, these are the complete Test Collections used including MEDLINE citations and any manually annotated results. In other cases, we provide PMIDs for the MEDLINE citations used in the experiments and it is up to the user to download these either from Entrez/PubMed or our MEDLINE Baseline Repository web sites. The packaging depends on what permissions we have been granted for releasing the data.

Datasets (DS) : Contain the basic components required to replicate a set of experiments. This will at a minimum include lists of PMIDs and may include additional information.

Test Collections (TC) : Contain a full complement of data required to replicate a set of experiments. This will include PMID lists, MEDLINE citations in XML and or ASCII MEDLINE format, annotated results, and detailed explanations of the results.

PLEASE NOTE

The records included in each version of the Datasets and Test Collections represent a static view of the data at the time each Dataset or Test Collection was created.

For example, the "Original 1999 Indexing and Format (January 20, 1999) Version" Test Collection represents a static view of PubMed/Medline as of January 20, 1999. There has been no reformatting of the text, or any updating of MeSH Indexing done to these records.

Test Collection used in experiments for paper: Finding medications doses in the literature, Dina Demner-Fushman, MD, PhD, James G. Mork, MS, Willie J. Rogers, Sonya E Shooshan, MLS, Laritza Rodriguez, MD, PhD, Alan R. Aronson, PhD. National Library of Medicine, National Institutes of Health, HHS, Bethesda, MD, USA.

Test Collection in Brat Standoff Format (300KB)

Dosage Info models with software and training features (1.8GB)

PMC Full Text Articles, subject terms, and experiment files used in the "Extracting Characteristics of the Study Subjects from Full-Text Articles" paper. The gzipped tar file contains the full set of PMC Full Text articles used in the experiments in both a single file and a subdirectory with each article self-contained in their own file. Each of the subject data files used in the experiments is also included. We have also included the list of 29 CheckTags nd 51 Mice/Rat terms used in the experiments.

00README file describing Test Collection files (1.1 kb)

2015 Subject Extraction Test Collection - Gzipped Tar File (597 mb)

MeSH Descriptor and MeSH Descriptor/MeSH Qualifier Vocabulary Density Study Datasets used in the "Recent Enhancements to the NLM Medical Text Indexer" paper and "Vocabulary Density Method for Customized Indexing of MEDLINE Journals" AMIA poster. One file contains the MeSH Descriptor Vocabulary Density Study results and one file contains the MeSH Descriptor/MeSH Qualifier Vocabulary Density Study results. There is also a document detailing how the datasets were created and the format of the files. NEW: We just added an Excel spreadsheet summarizing the number of unique MeSH Headings found for each of the journals in the study.

2014 MeSH Descriptor Vocabulary Density Study Dataset - Gzipped Tar File (82 MB)

2014 MeSH Descriptor/MeSH Qualifier Vocabulary Density Study Dataset - Gzipped Tar File (121 MB)

2014 Vocabulary Density Study Datasets Details - PDF File (57 KB)

2014 Vocabulary Density Summary by Journal - Excel File (291 KB)

Dataset of Training and Testing randomly selected PMIDs as well as True Positives for both Training and Testing used in the "Identifying Publication Types Using Machine Learning" paper.

2013 BioASQ Publication Types Paper Dataset - Gzipped Tar File (25 MB)

NOTE: The True Positive files are simply the MeSH Publication Types assigned to the various MEDLINE citations at the time of the experiments. The file format is PMID|MeSH Publication Type. There is a True Positives file for both the Training and Testing sets of MEDLINE citations. NOTE: There may be multiple Publication Types assigned to the same MEDLINE citation.

Lists of PMIDs for the Datasets used in the "Mining MEDLINE for problems associated with vitamin D" paper.

2013 Vitamin D Dataset - Tar File (190 KB)

Dataset of Training and Testing randomly selected PMIDs as well as True Positives for both Training and Testing used in the "Comparison and combination of several MeSH indexing approaches" machine learning paper. Also, please see the MTI_ML web page for more information.

2013 MTI_ML Dataset - Tar File (12 MB)

NOTE: The True Positive files are simply the MeSH indexing assigned to the various MEDLINE citations at the time of the experiments. The file format is PMID|MeSH Descriptor. There is a True Positives file for both the Training and Testing sets of MEDLINE citations.

Dataset of Training and Testing randomly selected PMIDs used in the "MeSH indexing: machine learning and lessons learned" machine learning paper. Also, please see the MTI_ML web page for more information.

2012 MTI_ML Dataset - Tar File (1.6 MB)

Dataset of Training and Testing randomly selected PMIDs used in the "A One-Size-Fits-All Indexing Method Does Not Exist: Automatic Selection Based on Meta-Learning." and "Automatic algorithm selection for MeSH Heading indexing based on meta-learning." machine learning papers. Also, please see the MTI_ML web page for more information.

2011 MTI_ML Dataset - Tar File (900 KB)

Test Collection used in our Gene Indexing Assistant (GIA) project. The GIA corpus consists of 151 manually annotated MEDLINE citations, randomly extracted from journals on human genetics with publication dates between 2002 to 2011. Sentences in each abstract were detected and tokenized using MetaMap.

Please Note: The use of The GIA Test Collection is covered by the MetaMap Terms and Conditions, please review that document prior to downloading the GIA Test Collection.

All sentences were processed by our Gene Mention Identification module to tag gene mentions, and then corrected manually by a single annotator. Explicit mentions of individual genes or gene products are normalized to the relevant Entrez Gene ID. In cases where an individual gene is indicated, but the annotator was unable to determine which Entrez Gene ID was correct, the ID has been identified as "-1". For compound mentions, the extent of each mention is marked as the information required to identify the gene. For example, for BRCA1/2, two gene mentions would be delineated as "BRCA1" and "BRCA1/2." Proteins that refer to multiple genes, or mentions of protein families, are not annotated.

The original sentence by sentence annotated file has been converted into a format consistent with the Biocreative format. The format used is as follows:

pmid|t|entire title here
pmid|a|entire abstract here
pmidoffset startoffset start+strlen1st gene mentionGeneEntrez Gene ID
pmidoffset startoffset start+strlen2nd gene mentionGeneEntrez Gene ID
pmidoffset startoffset start+strlen3rd gene mentionGeneEntrez Gene ID
...
<blank line>
Second article
<blank line>
Third article

Biocreative Formatted GIA Test Collection (271 kb)

Please Note: The original 1998 WSD Test Collection is considered out of date at this point. We strongly recommend using the more current version of WSD created by Antonio Jimeno-Yepes and Bridget McInnes - MSH WSD Dataset. The MSH WSD Dataset is more current and contains a much larger and richer set of ambiguities.

The test collection consists of 50 highly frequent ambiguous UMLS concepts from 1998 MEDLINE. Each of the 50 ambiguous cases has 100 ambiguous instances randomly selected from the 1998 MEDLINE citations. For a total of 5,000 instances. We had a total of 11 evaluators of which 8 completed 100% of the 5,000 instances, 1 completed 56%, 1 completed 44%, and the final evaluator completed 12% of the instances. Evaluations were only used when the evaluators completed all 100 instances for a given ambiguity. The following paper describes in more detail the development of the test collection:

Developing a Test Collection for Biomedical Word Sense Disambiguation, AMIA 2001 (93 KB)

Access to the WSD Test Collection requires an UMLS KS login.

Test Collection used in our Full Text experiments to date, and reported on in the 2005 AMIA paper, "Semi-Automatic Indexing of Full Text Biomedical Articles, AMIA 2005" (PDF: 100kb). For more detailed information on Indexing Initiative work involving Full Text, please see the section "Full Text Processing."

Original PubMed Central XML Format Original PubMed Central XML Format (October 22, 2003) Version (8 MB)

This is a tar file containing XML DTD files and a subdirectory "xml" which contains all 500 of the articles in separate *.pxml files.

Original PubMed Central ASCII MEDLINE Format (March 22, 2005) Version (1.4 mb)

This is a text file containing abstracts in ASCII MEDLINE format from PubMed Enrez for 498 of the 500 articles. This file contains the MeSH Indexing used for comparison purposes in the above mentioned paper. Two of the PMIDs in the test collection have PMIDs of "0" and do not have indexing in this file.

Pseudo-ASCII MEDLINE Formatted (December 12, 2003) Version (15.3 mb)

This is a single file containing all 500 articles put into a pseudo-ASCII MEDLINE format which is required for MTI.

Pseudo-ASCII MEDLINE Formatted Label Break-out (February 6, 2004) Version (15.4 mb)

This is a single file containing all 500 articles put into a pseudo-ASCII MEDLINE format which is required for MTI. This file differs from the above in that the "important" sections (which might have separate sub-sections) have been separated in the article and a new "citation" associated with the article PMID and label created. For example, "Background", "Methods", "Results", "Discussion", "Conclusions", etc. With the base abstract and title listed separately and first for each article.

Example (PMID 11884248): PMC file shows <abs><sec><st> Abstract</st> Background</st> is translated to "AB - Abstract | Background | " in the pseudo-ASCII MEDLINE Break-out version as the main section "Abstract" contains a sub-section "Background".

Example II (PMID 11884248): PMC file shows </abs></fm><bdy><sec><st> Background</st> is translated to "PMID- 11884248_Background" in the pseudo-ASCII MEDLINE Break-out version as a new section in the article is identified as "Background", so we create a new "citation" using the same PMID and the new section name as the identifier.

Test Collection used in our original experiments, tuning parameters phase, and now used to track improvements to MTI. We have included the original test collection from January 20, 1999 to allow comparison based on the actual data from that time. We have also included an updated version of the Test Collection to allow for more current comparison studies involving the use of PMIDs and current MeSH Indexing.

Original 1999 Indexing and Format (January 20, 1999) Version (473 kb)

Indexing and Format (March 14, 2007) Version (544 kb)