SKR Help Info

Skip Navigation  |   Indexing Initiative    NLM » LHNCBC » SKR » Help

Quick Guide:

SUPPORTED FILE FORMATS:

The SKR/MetaMap system requires as input: An ASCII file, and the file must be formatted in one of the formats listed below. For the best results, we recommend the first format "MEDLINE". The MEDLINE format is what the SKR/MetaMap program was initially built around and is still the best supported of all the formats. It should also be noted that it is always better to lump more items into a single file and submit that to the Scheduler and let it do the distribution for you. Instead, if you submit a larger number of smaller files with fewer entries, it forces the Scheduler to swap more and slows things down.

Note: Please also note that the Scheduler does not support non-ASCII characters. If your file has Unicode or UTF-8 character set characters, it will likely cause an error.

Note: If you are going to send free format text, please break your text into smaller chunks to run through the Scheduler. Large chunks of text take too long to process via the Scheduler. As a rule of thumb, we typically break free form text into chunks of around 2,000 - 3,000 characters.

  1. MEDLINE format with a blank line separating each item to be processed.
    Use of "PMID-" as an identifier tag is supported by all applications.

    Format Sample
    Columns:  12345678901234567890
              PMID- #########
              TI  - Some Title
                    Title line 2 & subsequent lines (if necessary).
              AB  - Abstract of item
                    Abstract line 2 & subsequent lines (if necessary).
    
              Alternatively,
    
              PMID- #########
              TI  - Some Title all one string.
              AB  - Abstract of item all one string extending over multiple lines 
    when necessary and as long as you need it too be.  This is sometimes easier 
    because you don't have to reformat you input as much.
    MEDLINE Sample File


  2. Free format with a blank line separating each item to be processed.

    Format Sample
    item 1 text to be processed free text
    item 1 line 2 of free text to be processed
    
    item 2 first line to be processed.
    Free Text Sample File


  3. Single Line Delimited Input
    NOTE: You MUST select "Single Line Delimited Input" from the list of "Scheduler Specific Options" on the various submission pages for this to work.

    Format Sample
    item 1 text to be processed free text
    item 2 text to be processed.
    Single Line Delimited Input Sample File


  4. Single Line Delimited Input w/ ID
    NOTE: You MUST select "Single Line Delimited Input w/ ID" from the list of "Scheduler Specific Options" on the various submission pages for this to work. This option assumes a two field input line: "ID|text to be processed". The ID can be a combination of any alpha-numeric characters and the underscore character ("_"). For example, "001_title" or "00001".

    Format Sample
    0000001|item 1 text to be processed free text with ID
    0000002|item 2 text to be processed.
    Single Line Delimited w/ ID Input Sample File




NOTES:
  1. You are only allowed to submit batch jobs as "Normal" priority.

  2. We are currently supporting 1999, 2006AA, 2009AA, 2009AB, 2010AA, and 2010AB UMLS Knowledge Sources. The usage of any year is selectable in both interactive and batch mode via the "Knowledge Source Options" pull-down menu.

  3. If you see one of your jobs developing a large number of errors, please go ahead and suspend the job and try to figure out what went wrong offline. This will free up the scheduling queue for other jobs to be run.

  4. The tagger/parser only supports ASCII files with blank lines separating the phrases to be parsed.

  5. None of the current programs available within the Scheduler support UTF-8! The input files must be converted to ASCII before submission.

Last Modified: October 21, 2014 lhc-lx-iipub2.nlm.nih.gov
     Contact Us    |   Contact Us (SemRep)    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov    Get Acrobat Reader button
Links to Our Sites:
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the SemRep program.
MetaMap
Program to map biomedical text to the UMLS Metathesaurus. Information and downloadable material for the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
MEDLINE Baseline Repository (MBR)
Static MEDLINE® Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Structured Abstracts (SA)
Information about NLM's research on Structured Abstracts in the MEDLINE® Baselines.
Lister Hill Center Homepage Link - Image of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Homepage Link - NLM Logo U.S. National Library of Medicine   NIH Homepage Link - NIH Logo National Institutes of Health
DHHS Homepage Link - DHHS Logo Department of Health and Human Services