Contact Us   
  Indexing Initiative > Datasets & Test Collections
This page contains Datasets and Test Collections related to the various research projects and papers associated with the Indexing Initiative project. In some cases, these are the complete Test Collections used including MEDLINE citations and any manually annotated results. In other cases, we provide PMIDs for the MEDLINE citations used in the experiments and it is up to the user to download these either from Entrez/PubMed or our MEDLINE Baseline Repository web sites. The packaging depends on what permissions we have been granted for releasing the data.

Datasets (DS) Logo denoting Dataset Available - Green circle with letter 'd' : Contain the basic components required to replicate a set of experiments. This will at a minimum include lists of PMIDs and may include additional information.

Test Collections (TC) Logo denoting Test Set Available - Blue circle with letter 't' : Contain a full complement of data required to replicate a set of experiments. This will include PMID lists, MEDLINE citations in XML and or ASCII MEDLINE format, annotated results, and detailed explanations of the results.

Name Type Description Date(s) Added
2015 Subject Extraction Test Collection Test Collection Logo - Blue circle with letter 't' PMC Full Text Articles, subject terms, and experiment files used in the "Extracting Characteristics of the Study Subjects from Full-Text Articles" paper. 13 Nov 2015
2014 Vocabulary Density Study Datasets Dataset Logo - Green circle with letter 'd' MeSH Descriptor and MeSH Descriptor/MeSH Qualifier Vocabulary Density Study Datasets used in the "Recent Enhancements to the NLM Medical Text Indexer" paper and "Vocabulary Density Method for Customized Indexing of MEDLINE Journals" AMIA poster. 25 Sep 2014
2013 BioASQ Publication Types Dataset Dataset Logo - Green circle with letter 'd' Dataset of Training and Testing randomly selected PMIDs as well as True Positives for both Training and Testing used in the "Identifying Publication Types Using Machine Learning" paper. 8 Sep 2013
2013 Vitamin D Dataset Dataset Logo - Green circle with letter 'd' Lists of PMIDs for the Datasets used in the "Mining MEDLINE for problems associated with vitamin D" paper. 14 Aug 2013
2013 MTI_ML Dataset Dataset Logo - Green circle with letter 'd' Dataset of Training and Testing randomly selected PMIDs as well as True Positives for both Training and Testing used in the "Comparison and combination of several MeSH indexing approaches" machine learning paper. 29 Jul 2013
2012 MTI_ML Dataset Dataset Logo - Green circle with letter 'd' Dataset of Training and Testing randomly selected PMIDs used in the "MeSH indexing: machine learning and lessons learned" machine learning paper. 29 Jul 2013
2011 MTI_ML Dataset Dataset Logo - Green circle with letter 'd' Dataset of Training and Testing randomly selected PMIDs used in the "A One-Size-Fits-All Indexing Method Does Not Exist: Automatic Selection Based on Meta-Learning." and "Automatic algorithm selection for MeSH Heading indexing based on meta-learning." machine learning papers. 29 Jul 2013
151 Citation GIA Test Collection Test Set Logo - Blue circle with letter 't' Test Collection used in our Gene Indexing Assistant (GIA) project. The GIA corpus consists of 151 manually annotated MEDLINE citations, randomly extracted from journals on human genetics with publication dates between 2002 to 2011. 2012
Word Sense Disambiguation (WSD) Test Collection Test Set Logo - Blue circle with letter 't' The test collection consists of 50 highly frequent ambiguous UMLS concepts from 1998 MEDLINE with manually annotated results. 2001
500 PubMed Central Full Text Test Collection Test Set Logo - Blue circle with letter 't' Test Collection used in our Full Text experiments to date. 22 Oct 2003
12 Dec 2003
6 Feb 2004
22 Mar 2005
200 MEDLINE Citations Test Collection Test Set Logo - Blue circle with letter 't' Test Collection used in our original experiments, tuning parameters phase, and now used to track improvements to MTI. 20 Jan 1999
14 Mar 2007


Details:

  • 2015 Subject Extraction Test Collection Logo denoting Test Collection Available - Blue circle with letter 't'
    PMC Full Text Articles, subject terms, and experiment files used in the "Extracting Characteristics of the Study Subjects from Full-Text Articles" paper. The gzipped tar file contains the full set of PMC Full Text articles used in the experiments in both a single file and a subdirectory with each article self-contained in their own file. Each of the subject data files used in the experiments is also included. We have also included the list of 29 CheckTags nd 51 Mice/Rat terms used in the experiments.




  • 2014 Vocabulary Density Study Datasets Logo denoting Dataset Available - Green circle with letter 'd'
    MeSH Descriptor and MeSH Descriptor/MeSH Qualifier Vocabulary Density Study Datasets used in the "Recent Enhancements to the NLM Medical Text Indexer" paper and "Vocabulary Density Method for Customized Indexing of MEDLINE Journals" AMIA poster. One file contains the MeSH Descriptor Vocabulary Density Study results and one file contains the MeSH Descriptor/MeSH Qualifier Vocabulary Density Study results. There is also a document detailing how the datasets were created and the format of the files. NEW: We just added an Excel spreadsheet summarizing the number of unique MeSH Headings found for each of the journals in the study.




  • 2013 BioASQ Publication Types Dataset Logo denoting Dataset Available - Green circle with letter 'd'
    Dataset of Training and Testing randomly selected PMIDs as well as True Positives for both Training and Testing used in the "Identifying Publication Types Using Machine Learning" paper.

    • 2013 BioASQ Publication Types Paper Dataset  2013 BioASQ Publication Types Paper Dataset - Gzipped Tar File  (25 mb)

    • NOTE: The True Positive files are simply the MeSH Publication Types assigned to the various MEDLINE citations at the time of the experiments. The file format is PMID|MeSH Publication Type. There is a True Positives file for both the Training and Testing sets of MEDLINE citations. NOTE: There may be multiple Publication Types assigned to the same MEDLINE citation.



  • 2013 Vitamin D Dataset Logo denoting Dataset Available - Green circle with letter 'd'
    Lists of PMIDs for the Datasets used in the "Mining MEDLINE for problems associated with vitamin D" paper.




  • 2013 MTI_ML Dataset Logo denoting Dataset Available - Green circle with letter 'd'
    Dataset of Training and Testing randomly selected PMIDs as well as True Positives for both Training and Testing used in the "Comparison and combination of several MeSH indexing approaches" machine learning paper. Also, please see the MTI_ML web page for more information.

    • 2013 MTI_ML Dataset  2013 MTI_ML Dataset - Tar File  (12 mb)

    • NOTE: The True Positive files are simply the MeSH indexing assigned to the various MEDLINE citations at the time of the experiments. The file format is PMID|MeSH Descriptor. There is a True Positives file for both the Training and Testing sets of MEDLINE citations.



  • 2012 MTI_ML Dataset Logo denoting Dataset Available - Green circle with letter 'd'
    Dataset of Training and Testing randomly selected PMIDs used in the "MeSH indexing: machine learning and lessons learned" machine learning paper. Also, please see the MTI_ML web page for more information.




  • 2011 MTI_ML Dataset Logo denoting Dataset Available - Green circle with letter 'd'
    Dataset of Training and Testing randomly selected PMIDs used in the "A One-Size-Fits-All Indexing Method Does Not Exist: Automatic Selection Based on Meta-Learning." and "Automatic algorithm selection for MeSH Heading indexing based on meta-learning." machine learning papers. Also, please see the MTI_ML web page for more information.




  • 151 Citation GIA Test Collection Logo denoting Test Set Available - Blue circle with letter 't'
    Test Collection used in our Gene Indexing Assistant (GIA) project. The GIA corpus consists of 151 manually annotated MEDLINE citations, randomly extracted from journals on human genetics with publication dates between 2002 to 2011. Sentences in each abstract were detected and tokenized using MetaMap.

    Please Note: The use of The GIA Test Collection is covered by the MetaMap Terms and Conditions, please review that document prior to downloading the GIA Test Collection.

    All sentences were processed by our Gene Mention Identification module to tag gene mentions, and then corrected manually by a single annotator. Explicit mentions of individual genes or gene products are normalized to the relevant Entrez Gene ID. In cases where an individual gene is indicated, but the annotator was unable to determine which Entrez Gene ID was correct, the ID has been identified as "-1". For compound mentions, the extent of each mention is marked as the information required to identify the gene. For example, for BRCA1/2, two gene mentions would be delineated as "BRCA1" and "BRCA1/2." Proteins that refer to multiple genes, or mentions of protein families, are not annotated.

    The original sentence by sentence annotated file has been converted into a format consistent with the Biocreative format. The format used is as follows:
    pmid|t|entire title here
    pmid|a|entire abstract here
    pmid<tab>offset start<tab>offset start+strlen<tab>1st gene mention<tab>Gene<tab>Entrez Gene ID
    pmid<tab>offset start<tab>offset start+strlen<tab>2nd gene mention<tab>Gene<tab>Entrez Gene ID
    pmid<tab>offset start<tab>offset start+strlen<tab>3rd gene mention<tab>Gene<tab>Entrez Gene ID
    ...
    <blank line>
    Second article
    <blank line>
    Third article
    



  • Word Sense Disambiguation (WSD) Test Collection Logo denoting Test Set Available - Blue circle with letter 't'
    Please Note: The original 1998 WSD Test Collection is considered out of date at this point. We strongly recommend using the more current version of WSD created by Antonio Jimeno-Yepes and Bridget McInnes - MSH WSD Dataset. The MSH WSD Dataset is more current and contains a much larger and richer set of ambiguities.

    The test collection consists of 50 highly frequent ambiguous UMLS concepts from 1998 MEDLINE. Each of the 50 ambiguous cases has 100 ambiguous instances randomly selected from the 1998 MEDLINE citations. For a total of 5,000 instances. We had a total of 11 evaluators of which 8 completed 100% of the 5,000 instances, 1 completed 56%, 1 completed 44%, and the final evaluator completed 12% of the instances. Evaluations were only used when the evaluators completed all 100 instances for a given ambiguity. The following paper describes in more detail the development of the test collection:

    Access to the WSD Test Collection requires an UMLS KS login.


  • 500 PubMed Central Full Text Test Collection Logo denoting Test Set Available - Blue circle with letter 't'
    Test Collection used in our Full Text experiments to date, and reported on in the 2005 AMIA paper, "Semi-Automatic Indexing of Full Text Biomedical Articles, AMIA 2005" (PDF: 100kb). For more detailed information on Indexing Initiative work involving Full Text, please see the section "Full Text Processing.
    • Original PubMed Central XML Format Original PubMed Central XML Format (October 22, 2003) Version   (8 mb)
      This is a tar file containing XML DTD files and a subdirectory "xml" which contains all 500 of the articles in separate *.pxml files.

    • Original PubMed Central ASCII MEDLINE Format Original PubMed Central ASCII MEDLINE Format (March 22, 2005) Version   (1.4 mb)
      This is a text file containing abstracts in ASCII MEDLINE format from PubMed Enrez for 498 of the 500 articles. This file contains the MeSH Indexing used for comparison purposes in the above mentioned paper. Two of the PMIDs in the test collection have PMIDs of "0" and do not have indexing in this file.

    • Pseudo-ASCII MEDLINE Format Pseudo-ASCII MEDLINE Formatted (December 12, 2003) Version   (15.3 mb)
      This is a single file containing all 500 articles put into a pseudo-ASCII MEDLINE format which is required for MTI.

    • Pseudo-ASCII MEDLINE Label Break-out Format Pseudo-ASCII MEDLINE Formatted Label Break-out (February 6, 2004) Version   (15.4 mb)
      This is a single file containing all 500 articles put into a pseudo-ASCII MEDLINE format which is required for MTI. This file differs from the above in that the "important" sections (which might have separate sub-sections) have been separated in the article and a new "citation" associated with the article PMID and label created. For example, "Background", "Methods", "Results", "Discussion", "Conclusions", etc. With the base abstract and title listed separately and first for each article.

      Example (PMID 11884248): PMC file shows <abs><sec><st> <p>Abstract</p></st> <p>Background</p></st>
      is translated to "AB - Abstract | Background | " in the pseudo-ASCII MEDLINE Break-out version as the main section "Abstract" contains a sub-section "Background".

      Example II (PMID 11884248): PMC file shows </abs></fm><bdy><sec><st> <p>Background</p></st>
      is translated to "PMID- 11884248_Background" in the pseudo-ASCII MEDLINE Break-out version as a new section in the article is identified as "Background", so we create a new "citation" using the same PMID and the new section name as the identifier.



  • 200 MEDLINE Citations Test Collection Logo denoting Test Set Available - Blue circle with letter 't'
    Test Collection used in our original experiments, tuning parameters phase, and now used to track improvements to MTI. We have included the original test collection from January 20, 1999 to allow comparison based on the actual data from that time. We have also included an updated version of the Test Collection to allow for more current comparison studies involving the use of PMIDs and current MeSH Indexing.




Copyright, Privacy, Accessibility, Viewers and Players,
Freedom of Information Act, Contact Us
Last Modified: November 13, 2015   
link to https://www.usa.gov/ - image is USA.gov logo link to https://www.hhs.gov - image is HHS.gov logo link to https://www.nih.gov - image is NIH.gov logo link to https://www.nlm.nih.gov - image spells out U.S. National Library of Medicine