Barrier Word Algorithm and Approximate Matching

TOOLS

Barrier Word Algorithm and Approximate Matching

The barrier word method is a fast way of identifying short noun phrases in free text. The text is parsed into sentences, where a sentence is computed as a set of words beginning with a capital letter and delimited by terminating punctuation. A potential nominal phrase is computed as a sequence of words occurring between barrier words, which are derived from a set of stopwords including articles, prepositions, and verbs. For example, consider the text: The local anesthetic bupivacaine is cardiotoxic when accidentally injected into the circulation. The set of barrier words might be used to identify local anesthetic bupivacaine, cardiotoxic, and circulation as nominal phrases. While this method has been used for some time, the use of a very long list of barrier words (approximately 24,000) was found to be much more effective in identifying nominal phrases in text than the traditional shorter lists.

Recent refinements and additions to the method include using a two-step parsing procedure as well as allowing certain words, which would otherwise be stopwords, to be included in the middle of the phrase. The two-step procedure first screens for chemical names (which otherwise might have punctuation marks or other linguistic infelicities in them). Chemical names occurring in the text are identified by a "sliding window technique" which requires the use of a long list of known chemicals. Such a list was obtained from the UMLS. Character by character, a string of characters is matched against this known list. As long as the string continues to match a chemical on the list, the process continues. It stops only when either the character by character match can no longer continue, or until a complete match with a chemical on the list occurs.

The second refinement of the method removes articles and prepositions, which might occur in the midst of a nominal phrase, from the barrier word list if the word preceding them is not in the barrier word list. Thus carcinoma of the pancreas would be identified as a nominal phrase despite the presence of the words of and the, which would otherwise be barrier words. The phrase consists of the pancreas would correctly not be identified as a nominal phrase because consists, a verb, is also a barrier word.

The noun phrases identified using the barrier word method are then processed by the Approximate Matching algorithm, a version of MetaMap's browse mode created for use in NLM's Large Scale Vocabulary Test. Approximate matching is much less strict than normal (semantic) MetaMap processing. In effect, it casts a wider net to locate Metathesaurus concepts that would be missed by MetaMap's semantic mode. This produces more concepts at the risk of including ones not closely related to the input noun phrase.