PhraseX and the SPECIALIST Minimal Commitment Parser
Home
PhraseX is a program that extracts noun phrase strings from text. It
does so by referring to the syntactic structure provided by the
SPECIALIST minimal commitment parser, which relies on the SPECIALIST
Lexicon as well as the Xerox stochastic tagger (Cutting et al. 1992).
The output produced is in the tradition of partial parsing (Hindle
1983, McDonald 1992, Weischedel et al. 1993) and concentrates on the
simple noun phrase--what Weischedel et al. (1993) call the "core noun
phrase," that is, a noun phrase with no modification to the right of
the head. Several approaches provide similar output based on
statistics (Church 1988, Zhai 1997, for example), a finite-state
machine (Ait-Mokhtar and Chanod 1997), or a hybrid approach combining
statistics and linguistic rules (Voutilainen and Padro 1997).
The SPECIALIST parser is based on the notion of barrier words
(Tersmette et al. 1988), which indicate boundaries between phrases.
After lexical look-up and resolution of category label ambiguity by
the tagger, complementizers, conjunctions, modals, prepositions, and
verbs are marked as boundaries. Subsequently, boundaries are
considered to open a new phrase (and close the preceding phrase). Any
phrase containing a noun is considered to be a (simple) noun phrase,
and in such a phrase, the right-most noun is labeled as the head; all
other items (other than determiners) are labeled as modifiers. An
example of the output from the SPECIALIST parser is given in (2) for
the input in (1).
(1) Kupffer cells from halothane-exposed guinea pigs carry
trifluoroacetylated protein adducts.
(2)
[
[
mod([lexmatch(['Kupffer']),inputmatch(['Kupffer']),tag(noun)]),
head([lexmatch([cells]),inputmatch([cells]),tag(noun)]) ],
[
prep([lexmatch([from]),inputmatch([from]),tag(prep)]),
mod([lexmatch([halothane]),inputmatch([halothane]),tag(noun)]),
punc([inputmatch([-])]),
mod([lexmatch([exposed]),inputmatch([exposed]),tag(adj)]),
head([lexmatch(['guinea pigs']),inputmatch([guinea,pigs]),
tag(noun)]) ],
[
verb([lexmatch([carry]),inputmatch([carry]),tag(verb)]) ],
[
mod([lexmatch([trifluoroacetylated]),
inputmatch([trifluoroacetylated]),tag(adj)]),
mod([lexmatch([protein]),inputmatch([protein]),tag(noun)]),
head([lexmatch([adducts]),inputmatch([adducts]),tag(noun)]),
punc([inputmatch(['.'])]) ]
]
The underspecified structure produced by the SPECIALIST parser serves
as the basis for the extraction of noun phrase strings by PhraseX. In
addition to the simple noun phrase (labeled as "simp" in output),
PhraseX identifies two additional structures. One of these is the
complex noun phrase in which a head is followed by contiguous
prepositional phrases to its right ("macro"); The first preposition in
this structure can be anything, but all the rest must be "of". The
second structure is not a canonical syntactic phenomenon, but may be
important for information processing. Such a phrase includes all the
content words that occur in a sentence either to the left or the right
of a finite verb ("mega"). Examples of these strings as extracted from
the syntactic structure in (2) are given in (3).
(3) 00000000|simp|kupffer cells
00000000|simp|halothane exposed guinea pigs
00000000|simp|trifluoroacetylated protein adducts
00000000|macro|kupffer cells from halothane exposed guinea pigs
00000000|mega|kupffer cells from halothane exposed guinea pigs
00000000|mega|trifluoroacetylated protein adducts
References
Ait-Mokhtar, Salah and Jean-Pierre Chanod. (1997). Incremental finite-state parsing.
Proceedings of the Fifth Conference on Applied Natural Language Processing, 72-79.
Church, Kenneth W. (1988). A stochastic parts program and noun phrase parser for unrestricted
text. Proceedings of the Second Conference on Applied Natural Language Processing,
136-143.
Cutting Douglas R., J. Kupiec, Jan O. Pedersen, and P. Sibun. (1992). A practical part-of-speech
tagger. In Proceedings of the Third Conference on Applied Natural Language
Processing.
Hindle, Donald. (1983). Deterministic parsing of syntactic non-fluencies. Proceedings of the 21st
Annual Meeting of the Association for Computational Linguistics, 123-128.
McDonald, David D. (1992). Robust partial parsing through incremental, multi-algorithm processing.
In Paul S. Jacobs (ed.) Text-Based Intelligent Systems, 83-99.
Tersmette KWF, Scott AF, Moore GW, Matheson NW, and Miller RE. (1988). Barrier word
method for detecting molecular biology multiple word terms. In Greenes RA (ed.) Proceedings
of the 12th Annual Symposium on Computer Applications in Medical Care,:207-211.
Vourtilainen, Atro and Lluis Padro. (1997). Developing a hybrid NP parser. Proceedings of the
Fifth Conference on Applied Natural Language Processing, 80-87.
Weischedel, Ralph, Marie Meteer, Richard Schwartz, Lance Ramshaw, and Jeff Palmucci.
(1993). Coping with ambiguity and unknown words through probabilistic models.
Computational Linguistics 19(2):359-382.
Zhai, Chengxiang. (1997). Fast statistical parsing of noun phrases for document indexing. Proceedings
of the Fifth Conference on Applied Natural Language Processing, 312-319