PhraseX and the SPECIALIST Minimal Commitment Parser

Indexing Initiative
PhraseX is a program that extracts noun phrase strings from text. It does so by referring to the syntactic structure provided by the SPECIALIST minimal commitment parser, which relies on the SPECIALIST Lexicon as well as the Xerox stochastic tagger (Cutting et al. 1992). The output produced is in the tradition of partial parsing (Hindle 1983, McDonald 1992, Weischedel et al. 1993) and concentrates on the simple noun phrase--what Weischedel et al. (1993) call the "core noun phrase," that is, a noun phrase with no modification to the right of the head. Several approaches provide similar output based on statistics (Church 1988, Zhai 1997, for example), a finite-state machine (Ait-Mokhtar and Chanod 1997), or a hybrid approach combining statistics and linguistic rules (Voutilainen and Padro 1997).

The SPECIALIST parser is based on the notion of barrier words (Tersmette et al. 1988), which indicate boundaries between phrases. After lexical look-up and resolution of category label ambiguity by the tagger, complementizers, conjunctions, modals, prepositions, and verbs are marked as boundaries. Subsequently, boundaries are considered to open a new phrase (and close the preceding phrase). Any phrase containing a noun is considered to be a (simple) noun phrase, and in such a phrase, the right-most noun is labeled as the head; all other items (other than determiners) are labeled as modifiers. An example of the output from the SPECIALIST parser is given in (2) for the input in (1).
(1) Kupffer cells from halothane-exposed guinea pigs carry 
    trifluoroacetylated protein adducts.

(2)

[
   [
     mod([lexmatch(['Kupffer']),inputmatch(['Kupffer']),tag(noun)]),
     head([lexmatch([cells]),inputmatch([cells]),tag(noun)]) ],
   [
     prep([lexmatch([from]),inputmatch([from]),tag(prep)]),
     mod([lexmatch([halothane]),inputmatch([halothane]),tag(noun)]),
     punc([inputmatch([-])]),
     mod([lexmatch([exposed]),inputmatch([exposed]),tag(adj)]),
     head([lexmatch(['guinea pigs']),inputmatch([guinea,pigs]),
                                                      tag(noun)]) ],
   [
     verb([lexmatch([carry]),inputmatch([carry]),tag(verb)]) ],
   [
     mod([lexmatch([trifluoroacetylated]),
                       inputmatch([trifluoroacetylated]),tag(adj)]),
     mod([lexmatch([protein]),inputmatch([protein]),tag(noun)]),
     head([lexmatch([adducts]),inputmatch([adducts]),tag(noun)]),
     punc([inputmatch(['.'])]) ]
]
The underspecified structure produced by the SPECIALIST parser serves as the basis for the extraction of noun phrase strings by PhraseX. In addition to the simple noun phrase (labeled as "simp" in output), PhraseX identifies two additional structures. One of these is the complex noun phrase in which a head is followed by contiguous prepositional phrases to its right ("macro"); The first preposition in this structure can be anything, but all the rest must be "of". The second structure is not a canonical syntactic phenomenon, but may be important for information processing. Such a phrase includes all the content words that occur in a sentence either to the left or the right of a finite verb ("mega"). Examples of these strings as extracted from the syntactic structure in (2) are given in (3).
(3) 00000000|simp|kupffer cells 
    00000000|simp|halothane exposed guinea pigs 
    00000000|simp|trifluoroacetylated protein adducts 
    00000000|macro|kupffer cells from halothane exposed guinea pigs 
    00000000|mega|kupffer cells from halothane exposed guinea pigs 
    00000000|mega|trifluoroacetylated protein adducts 

References

Ait-Mokhtar, Salah and Jean-Pierre Chanod. (1997). Incremental finite-state parsing.
     Proceedings of the Fifth Conference on Applied Natural Language Processing, 72-79.

Church, Kenneth W. (1988). A stochastic parts program and noun phrase parser for unrestricted
     text. Proceedings of the Second Conference on Applied Natural Language Processing,
     136-143.

Cutting Douglas R., J. Kupiec, Jan O. Pedersen, and P. Sibun. (1992). A practical part-of-speech
     tagger. In Proceedings of the Third Conference on Applied Natural Language
     Processing.

Hindle, Donald. (1983). Deterministic parsing of syntactic non-fluencies. Proceedings of the 21st
     Annual Meeting of the Association for Computational Linguistics, 123-128.

McDonald, David D. (1992). Robust partial parsing through incremental, multi-algorithm processing.
     In Paul S. Jacobs (ed.) Text-Based Intelligent Systems, 83-99.

Tersmette KWF, Scott AF, Moore GW, Matheson NW, and Miller RE. (1988). Barrier word
     method for detecting molecular biology multiple word terms. In Greenes RA (ed.) Proceedings
     of the 12th Annual Symposium on Computer Applications in Medical Care,:207-211.

Vourtilainen, Atro and Lluis Padro. (1997). Developing a hybrid NP parser. Proceedings of the
     Fifth Conference on Applied Natural Language Processing, 80-87.

Weischedel, Ralph, Marie Meteer, Richard Schwartz, Lance Ramshaw, and Jeff Palmucci.
     (1993). Coping with ambiguity and unknown words through probabilistic models.
     Computational Linguistics 19(2):359-382.

Zhai, Chengxiang. (1997). Fast statistical parsing of noun phrases for document indexing. Proceedings
     of the Fifth Conference on Applied Natural Language Processing, 312-319

Last Modified: May 30, 2019 ii-public2
     Contact Us    |   Contact Us (SemRep)    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov    Get Acrobat Reader button
Links to Our Sites:
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the SemRep program.
MetaMap
Program to map biomedical text to the UMLS Metathesaurus. Information and downloadable material for the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
MEDLINE Baseline Repository (MBR)
Static MEDLINE® Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Structured Abstracts (SA)
Information about NLM's research on Structured Abstracts in the MEDLINE® Baselines.
Lister Hill Center Homepage Link - Image of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Homepage Link - NLM Logo U.S. National Library of Medicine   NIH Homepage Link - NIH Logo National Institutes of Health
DHHS Homepage Link - DHHS Logo Department of Health and Human Services