Barrier Word Algorithm and Approximate Matching

Indexing Initiative
The barrier word method is a fast way of identifying short noun phrases in free text. The text is parsed into sentences, where a sentence is computed as a set of words beginning with a capital letter and delimited by terminating punctuation. A potential nominal phrase is computed as a sequence of words occurring between barrier words, which are derived from a set of stopwords including articles, prepositions, and verbs. For example, consider the text: The local anesthetic bupivacaine is cardiotoxic when accidentally injected into the circulation. The set of barrier words might be used to identify local anesthetic bupivacaine, cardiotoxic, and circulation as nominal phrases. While this method has been used for some time, the use of a very long list of barrier words (approximately 24,000) was found to be much more effective in identifying nominal phrases in text than the traditional shorter lists.

Recent refinements and additions to the method include using a two-step parsing procedure as well as allowing certain words, which would otherwise be stopwords, to be included in the middle of the phrase. The two-step procedure first screens for chemical names (which otherwise might have punctuation marks or other linguistic infelicities in them). Chemical names occurring in the text are identified by a "sliding window technique" which requires the use of a long list of known chemicals. Such a list was obtained from the UMLS. Character by character, a string of characters is matched against this known list. As long as the string continues to match a chemical on the list, the process continues. It stops only when either the character by character match can no longer continue, or until a complete match with a chemical on the list occurs.

The second refinement of the method removes articles and prepositions, which might occur in the midst of a nominal phrase, from the barrier word list if the word preceding them is not in the barrier word list. Thus carcinoma of the pancreas would be identified as a nominal phrase despite the presence of the words of and the, which would otherwise be barrier words. The phrase consists of the pancreas would correctly not be identified as a nominal phrase because consists, a verb, is also a barrier word.

The noun phrases identified using the barrier word method are then processed by the Approximate Matching algorithm, a version of MetaMap's browse mode created for use in NLM's Large Scale Vocabulary Test. Approximate matching is much less strict than normal (semantic) MetaMap processing. In effect, it casts a wider net to locate Metathesaurus concepts that would be missed by MetaMap's semantic mode. This produces more concepts at the risk of including ones not closely related to the input noun phrase.

Last Modified: May 30, 2019 ii-public2
     Contact Us    |   Contact Us (SemRep)    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |    Get Acrobat Reader button
Links to Our Sites:
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the SemRep program.
Program to map biomedical text to the UMLS Metathesaurus. Information and downloadable material for the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
MEDLINE Baseline Repository (MBR)
Static MEDLINE® Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Structured Abstracts (SA)
Information about NLM's research on Structured Abstracts in the MEDLINE® Baselines.
Lister Hill Center Homepage Link - Image of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Homepage Link - NLM Logo U.S. National Library of Medicine   NIH Homepage Link - NIH Logo National Institutes of Health
DHHS Homepage Link - DHHS Logo Department of Health and Human Services