Journal of the American Society for Information Science and Technology

Early View   (Articles online in advance of print)

Published Online: 3 Nov 2005

Copyright © 2005 Wiley Periodicals, Inc., A Wiley Company

Journal Homepage Link
E-Mail and Print ToolbarE-MailPrint
 Research Article
Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment
Susanne M. Humphrey, Willie J. Rogers, Halil Kilicoglu, Dina Demner-Fushman, Thomas C. Rindflesch
Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD 20894
email: Susanne M. Humphrey (humphrey@nlm.nih.gov) Willie J. Rogers (wrogers@nlm.nih.gov) Halil Kilicoglu (halil@nlm.nih.gov) Dina Demner-Fushman (dina_demner@nlm.nih.gov) Thomas C. Rindflesch (tcr@nlm.nih.gov)

Abstract
AbstractIntroduction and BackgroundMethodology of JDI-Based ST IndexingRelated WorkExperimental MethodReferences
An experiment was performed at the National Library of Medicine® (NLM®) in word sense disambiguation (WSD) using the Journal Descriptor Indexing (JDI) methodology. The motivation is the need to solve the ambiguity problem confronting NLM's MetaMap system, which maps free text to terms corresponding to concepts in NLM's Unified Medical Language System® (UMLS®) Metathesaurus®. If the text maps to more than one Metathesaurus concept at the same high confidence score, MetaMap has no way of knowing which concept is the correct mapping. We describe the JDI methodology, which is ultimately based on statistical associations between words in a training set of MEDLINE® citations and a small set of journal descriptors (assigned by humans to journals per se) assumed to be inherited by the citations. JDI is the basis for selecting the best meaning that is correlated to UMLS semantic types (STs) assigned to ambiguous concepts in the Metathesaurus. For example, the ambiguity transport has two meanings: Biological Transport assigned the ST Cell Function and Patient transport assigned the ST Health Care Activity. A JDI-based methodology can analyze text containing transport and determine which ST receives a higher score for that text, which then returns the associated meaning, presumed to apply to the ambiguity itself. We then present an experiment in which a baseline disambiguation method was compared to four versions of JDI in disambiguating 45 ambiguous strings from NLM's WSD Test Collection. Overall average precision for the highest-scoring JDI version was 0.7873 compared to 0.2492 for the baseline method, and average precision for individual ambiguities was greater than 0.90 for 23 of them (51%), greater than 0.85 for 24 (53%), and greater than 0.65 for 35 (79%). On the basis of these results, we hope to improve performance of JDI and test its use in applications.


Received: 26 July 2004; Revised: 1 October 2004; Accepted: 10 November 2004

Digital Object Identifier (DOI)

10.1002/asi.20257  About DOI

Article Text

AbstractIntroduction and BackgroundMethodology of JDI-Based ST IndexingRelated WorkExperimental MethodReferences

The objective of NLM's Indexing Initiative (National Library of Medicine, [2004a]) is to investigate methods whereby automatic indexing methods partially or completely substitute for current indexing practices (Aronson et al., [2000]). The prototype indexing system developed under this initiative eventually became the Medical Text Indexer (MTI) (Aronson, Mork, Gay, Humphrey, & Rogers, [2004]), which now actively participates in MEDLINE indexing using terms from NLM's Medical Subject Headings (MeSH®) thesaurus (National Library of Medicine, [2004b]). MTI indexes about 3,700 citations a day 5 nights a week. Indexers accept the option of viewing the resulting MTI recommendations about 379 times per day, including weekends. It is estimated that MTI recommendations are accessed by indexers during the indexing of 20% of MEDLINE articles. MTI has also been used as the sole indexing method for about 79,000 meeting abstracts on human immunodeficiency virus/autoimmune deficiency syndrome (HIV/AIDS), health services research, and space life sciences.

MTI has as a major component the MetaMap program (Aronson, [2001]), which maps biomedical text to concepts in the UMLS Metathesaurus (National Library of Medicine, [2004c]). MetaMap is a knowledge-based method that relies on the SPECIALIST Lexicon (a component of the UMLS) and an underspecified syntactic parser to identify noun phrases in biomedical text. The best match between a noun phrase and a Metathesaurus concept is computed by accommodating lexical variation in the input phrase and allowing partial matches between the phrase and concept. A confidence score is assigned to each mapping to reflect the closeness of match of the input noun phrase to the target Metathesaurus concept. For example, the phrase between the blastocyst trophectoderm in the following sentence from a MEDLINE abstract:

s1 In the mouse, the process of implantation is initiated by the attachment reaction between the blastocyst trophectoderm and uterine luminal epithelium that occurs at 2200-2300 h on day 4 (day 1 = vaginal plug) of pregnancy.
maps to only one Metathesaurus concept:
694 Blastocyst [Embryonic Structure]

The confidence score, 694 out of 1,000, and UMLS semantic type (ST) for the concept, Embryonic Structure, are provided as output. Semantic types are a set of 135 labels in the UMLS Semantic Network for concept categories in the biomedical domain, e.g., Disease or Syndrome, Therapeutic or Preventive Procedure, Body Substance, and Pharmacologic Substance. Metathesaurus concepts are assigned one or more STs, which form an isa link from the concept to the ST, in this example, Blastocyst is a Embryonic Structure.

However, the phrase of implantation maps to two Metathesaurus concepts, both with confidence scores of 1,000:

   1000 Implantation <1> (Blastocyst Implantation, natural) [Organism Function]
   1000 Implantation <2> (Implantation procedure, natural) [Therapeutic or Preventive Procedure]

This result illustrates the problem of ambiguous mappings. Although Blastocyst Implantation, natural is the correct mapping, MetaMap has no way of choosing which of these concepts represents the meaning of this input phrase. This phenomenon is caused by word sense ambiguity in English, and currently MetaMap does not choose between ambiguous mappings. Because MetaMap is the core component of MTI, automatic indexing of MEDLINE will be enhanced by providing a method for resolving this kind of ambiguity.

The extent of the ambiguity problem was shown in an experiment conducted in connection with developing NLM's Word Sense Disambiguation (WSD) test collection (Weeber, Mork, & Aronson, [2001]) whereby 409,337 MEDLINE citations indexed in 1998 were run through MetaMap, resulting in more than 34 million phrases. About 4 million phrases (11.7%) had more than one mapping to Metathesaurus concepts; 94% of these cases were ambiguities in which an exact string mapped to more than one concept. These sorts of ambiguity became the focus of developing the WSD test collection.

The purpose of the WSD test collection was to establish a testbed of humanly disambiguated instances to serve as a gold standard for evaluating automatic disambiguation methods. From the list of ambiguous strings from the processed phrases, 50 highly frequent ones were selected at random from the entire 1998 MEDLINE database. Appendix A shows all 50 ambiguities in the test collection with their respective Metathesaurus concepts and ST abbreviations. For example, the ambiguity transport maps to two concepts, Biological Transport with ST celf (abbreviation for Cell Function) and Patient transport with ST hlca (abbreviation for Health Care Activity). From now on we use abbreviated forms for the few STs mentioned in the text of this article; their full forms can be found in Appendix B, which lists the 44 ST abbreviations and full forms represented in the test collection. Appendix C gives a hierarchical view of these STs.

For each ambiguity, 100 instances (sentences containing the ambiguity) were selected. Thus, there were 5,000 instances to be disambiguated by human raters. A Web-based interface was developed to facilitate the human disambiguation procedure, showing the citation with the highlighted sentence containing the ambiguous string to be considered. The actual manual task was reduced to two mouse clicks for each instance: selecting one and only one sense or passing for the time being. Figure 1 shows the result of the eight raters' choices for disambiguating s1, unanimously in favor of Blastocyst Implantation, natural (having ST orgf).

Figure 1. Result of choices of eight raters who used the WSD interface to disambiguate s1, unanimously selecting Blastocyst Implantation, natural (having ST orgf ).
[Normal View 55K | Magnified View 75K]

NLM is investigating Journal Descriptor Indexing (JDI), a novel approach to fully automatic indexing based on NLM's practice of maintaining a subject index to journal titles using journal descriptors (JD's), which are terms corresponding to biomedical specialties (Humphrey, [1998], [1999]). JDI methodology has been extended to ST indexing (Humphrey, Rindflesch, & Aronson, [2000]), both described in the next section. Using the preceding example, s1 can be indexed automatically by ST where each ST is ranked with a score from 0 to 1 (Table 1). In this indexing, orgf (Organism Function) ranks higher than topp (Therapeutic or Preventive Procedure), thus indicating that Blastocyst Implantation, natural (having ST orgf) is a better meaning for the sentence than Implantation procedure (having ST topp), and therefore the better meaning for the ambiguous string implantation in this sentence, as is consistent with human raters (Figure 1).

 
Table 1. ST indexing of s1 In the mouse, the process of implantation is initiated by the attachment reaction between the blastocyst trophectoderm and uterine luminal epithelium that occurs at 2200-2300 h on day 4 (day 1 = vaginal plug) of pregnancy.

RankST abbrSemantic TypeScore
1orgfOrganism Function0.5897
14spcoSpatial Concept0.4841
15diapDiagnostic Procedure0.4831
18toppTherapeutic or Preventive Procedure0.4591
25emstEmbryonic Structure0.4301
41aappAmino Acid, Peptide, or Protein0.3724
104vtbtVertebrate0.2210

On the other hand, as seen in Figure 2, human raters unanimously selected Implantation procedure (having ST topp) for disambiguating the following sentence with the same ambiguous string implantation:

s2 We conclude that artificial sphincter implantation is safe, reliable, and very effective in treating incontinence caused by sphincteric dysfunction in properly selected patients.

Figure 2. Result of choices of eight raters who used the WSD interface to disambiguate s1, unanimously selecting Implantation procedure (having ST topp).
[Normal View 47K | Magnified View 64K]

ST indexing of s2 ranks topp higher than orgf (Table 2), thus indicating Implantation procedure (having ST topp) is a better meaning for the sentence, and therefore the ambiguous string implantation in that sentence, also consistently with human raters (Figure 2).

 
Table 2. ST indexing of s2 We conclude that artificial sphincter implantation is safe, reliable and very effective in treating incontinence due to sphincteric dysfunction in properly selected patients.

RankST abbrSemantic TypeScore
1diapDiagnostic Procedure0.6238
2toppTherapeutic or Preventive Procedure0.6098
3spcoSpatial Concept0.5627
9orgfOrganism Function0.4797
59aappAmino Acid, Peptide, or Protein0.2739
85emstEmbryonic Structure0.2181
119vtbtVertebrate0.1349

This article describes experiments in applying JDI-based methodology to the WSD problem using the WSD Test Collection. This methodology will be explained in the next section.

AbstractIntroduction and BackgroundMethodology of JDI-Based ST IndexingRelated WorkExperimental MethodReferences

Ultimately, JDI relies on ST indexing of some context in which the ambiguous string appears, as illustrated in the previous section, where the context is the sentences containing implantation. If a sentence can be indexed by a ranked list of STs, and the ambiguous string in the sentence can be mapped to two possible concepts, which have different STs assigned to them, then the higher-ranked ST and its corresponding concept win as representing the meaning of the string. In other words, whichever ST ranks higher for the context of the ambiguity is considered the better of the two STs for the ambiguity itself; once the better ST is chosen, the corresponding concept is also chosen.

The ST indexing used for the WSD application relies on a word-ST table whereby each word in a training set is associated with an ST vector consisting of 129 ST rankings, ordered alphabetically by ST abbreviation. The training set consists of titles and abstracts of 910,542 MEDLINE citations to articles from 3,993 journals indexed in 1999 and 2000, which contain 232,676 unique words (meeting certain criteria such as having at least three characters, beginning with an alphabetic character, and occurring at least twice in the training set). Use of the JDI methodology for generating the word-ST tables based on the training set is described later. However, informally, an ST vector describes the semantic context in which a word occurs.

For example, ST vectors for the words implantation, blastocyst, and sphincter are shown in Tables 3, 4, and 5, respectively. Note: rather than display all STs, we selected the first and last STs (aapp [Amino Acid, Peptide or Protein] and vtbt [Vertebrate]) alphabetically by ST abbreviation; the set of highest-ranking STs for each word (topp for implantation, emst [Embryonic Structure] for blastocyst, diap [Diagnostic Procedure] for sphincter); and the STs of interest for disambiguating implantation (orgf; topp) shown in boldface. High-ranking STs in these examples reflect the semantic contexts in which the words commonly occur, which have a significant impact on word sense disambiguation. Blastocyst, for example, most often occurs in text describing organism function, as seen by the high rank of the corresponding ST in Table 4. Sphincter, on the other hand, is more often associated with procedures (high rank of topp in Table 5). The two semantic types orgf and topp have relatively high rank in the ST vector implantation (Table 3), which commonly occurs in both environments. As described subsequently, our methodology relies on computing semantic contexts for sentences containing ambiguous strings such as implantation by using precomputed semantic contexts of cooccurring words in the sentence such as blastocyst or sphincter.

 
Table 3. Items in ST vector for implantation.

RankST abbrSemantic TypeScore
57aappAmino Acid, Peptide, or Protein0.3373
5diapDiagnostic Procedure0.6637
39emstEmbryonic Structure0.4168
13orgfOrganism Function0.6013
1spcoSpatial Concept0.7027
2toppTherapeutic or Preventive Procedure0.6937
108vtbtVertebrate0.1748

 
Table 4. Items in ST vector for blastocyst.

RankST abbrSemantic TypeScore
24aappAmino Acid, Peptide, or Protein0.2160
44diapDiagnostic Procedure0.1728
1emstEmbryonic Structure0.6096
2orgfOrganism Function0.4998
46spcoSpatial Concept0.1654
45toppTherapeutic or Preventive Procedure0.1695
41vtbtVertebrate0.1780

 
Table 5. Items in ST vector for sphincter.

RankST abbrSemantic TypeScore
66aappAmino Acid, Peptide, or Protein0.1638
1diapDiagnostic Procedure0.6746
100emstEmbryonic Structure0.1068
21orgfOrganism Function0.3584
3spcoSpatial Concept0.5660
2toppTherapeutic or Preventive Procedure0.6528
118vtbtVertebrate0.0518

Knowing the ST scores for individual words, we now can compute a vector that is the centroid of the ST vectors for all words in some context, such as a phrase or sentence. The score for an ST in the centroid is the average of the rankings for this ST across the words in the context. A display of STs in the centroid in rank order becomes the ranked ST indexing for the context. Table 6 shows ST indexing for the phrase blastocyst implantation where the ST scores are the average of the same ST scores for implantation (Table 3) and blastocyst (Table 4); e.g., (0.4998 [blastocyst orgf score] + 0.6013 [implantation orgf score]) ÷ 2 = 0.5506 [blastocyst implantation orgf score]; orgf is appropriately ranked higher than topp for the phrase. Similarly, Table 7 shows ST indexing for the phrase sphincter implantation where the ST scores are the average of the same ST scores for implantation (Table 3) and sphincter (Table 5); topp is appropriately ranked higher than orgf for the phrase.

 
Table 6. ST indexing of blastocyst implantation.

RankST abbrSemantic TypeScore
1orgfOrganism Function0.5506
4emstEmbryonic Structure0.5132
12SpcoSpatial Concept0.4340
13toppTherapeutic or Preventive Procedure0.4316
16diapDiagnostic Procedure0.4182
45aappAmino Acid, Peptide, or Protein0.2766
92vtbtVertebrate0.1764

 
Table 7. ST indexing of sphincter implantation.

RankST abbrSemantic TypeScore
1toppTherapeutic or Preventive Procedure0.6732
2diapDiagnostic Procedure0.6692
3spcoSpatial Concept0.6344
18orgfOrganism Function0.4798
59emstEmbryonic Structure0.2618
62aappAmino Acid, Peptide, or Protein0.2506
116vtbtVertebrate0.1133

The same methodology is applied for computing ST scores for the sentences containing the ambiguous string implantation in order to select the better concept mapping according to relative scores of STs assigned to the concepts. In ST indexing of s1 (Table 1) the higher score for orgf (compared to topp) selects the Blastocyst Implantation concept, whereas in ST indexing of S2 (Table 2) the higher score for topp selects the Implantation procedure concept.

We will now describe the JDI methodology and the way it is used for generating word-ST tables used for ST indexing. JDI uses statistical associations between the words in the training set and 127 JDs that index the approximately 4000 MEDLINE journals per se in terms of biomedical disciplines (National Library of Medicine, 2002). Table 8 shows a sample journal record (Journal Identifier, Title, Title Abbreviation, Journal Descriptor) for Fertility and Sterility in NLM's journal (i.e., serial records) database.

 
Table 8. NLM journal record for Fertility and Sterility showing the JD Reproduction.

JID0372772
TIFertility and Sterility
TAFertil Steril
JDReproduction

Table 9 shows a sample citation (PubMed Identifier, Title, Title Abbreviation, Journal Identifier, Source, Journal Descriptor) from the training set, including the JD Reproduction, which we mapped from the journal record. Thus, citations inherit JDs from journal records corresponding to the journals in which the documents are published. Each word in the sample title (Table 9) from the training set (including implantation, which we emphasize) can be said to cooccur with the JD Reproduction by virtue of this inheritance.

 
Table 9. Sample MEDLINE citation in the training set showing inheritance of JD from NLM journal record.

PMID10856474
TIBlastocyst score affects implantation and pregnancy outcome: toward a single blastocyst transfer.
JID0372772
SOFertil Steril 2000 Jun;73(6):1155-8.
aJDReproduction

   aMapped from the journal record for Fertility and Sterility (Table 8).

Because each citation in the training set inherits one or more JDs, an association between words and JDs can be represented as the number of cooccurrences of each word with each JD in the citations in the training set. The JD scores for implantation can be expressed by the ratio of the number of citations in which implantation cooccurs with the JD, divided by the total citation count for implantation. The 127 JD scores for implantation, ordered alphabetically by JD, form a JD vector. For example, part of the JD vector for implantation is shown in Table 10. Note: Rather than display all JDs, we selected the first and last JDs alphabetically (which, incidentally, never cooccur with implantation) and the five highest-ranking JDs.

 
Table 10. Items in JD vector for implantation.

RankJournal DescriptorScore

109Acquired Immunodeficiency Syndrome0.0000
4Biomedical Engineering0.4067
2Cardiology0.6416
3Ophthalmology0.6405
5Otolaryngology0.3741
1Reproduction0.9044
109Zoology0.0000

We therefore can assign JDs as indexing terms to some text on the basis of the words in it. Analogously to ST indexing that uses ST vectors, we perform JD indexing by computing a JD vector, which is the centroid of the JD vectors for the words in the text to be indexed. The score for a JD in the centroid is the average of the scores for this JD across the words. A display of JDs in the centroid in rank order becomes the ranked JD indexing for the text. Tables 11 and 12 show the first five JDs in the indexing of s1 and s2, respectively. The JD scores for each JD are the average of the scores for the same JD for words in the sentences. For example, for s1, the score for Reproduction is based on the average of the scores for Reproduction in the JD indexing of words taken from the sentence: implantation, attachment, blastocyst, uterine, luminal, epithelium, vaginal, plug, pregnancy (allowing for conditions to ignore certain words, such as membership in a stopwords list and nonoccurrence in the UMLS Metathesaurus). As shown in Table 11, the outstanding JD for s1 is Reproduction; in Table 12, the outstanding JD for s2 is Urology.

 
Table 11. JD indexing of s1 In the mouse, the process of implantation is initiated by the attachment reaction between the blastocyst trophectoderm and uterine luminal epithelium that occurs at 2200-2300 h on day 4 (day 1 = vaginal plug) of pregnancy.

RankScoreJournal Descriptor

10.1431Reproduction
20.0747Obstetrics
30.0735Gynecology
40.0257Embryology
50.0245Veterinary Medicine

 
Table 12. JD indexing of s2 We conclude that artificial sphincter implantation is safe, reliable and very effective in treating incontinence due to sphincteric dysfunction in properly selected patients.

RankScoreJournal Descriptor

10.1857Urology
20.0522Gynecology
30.0504Gastroenterology
40.0423Obstetrics
50.0321Reproduction

However, this JD indexing as such is not useful for WSD. What we need is ST indexing for selecting the best MetaMap concept mapping, as described earlier. The way we achieve this indexing is by creating ST documents as documents to undergo JD indexing, where an ST document is a set of Metathesaurus words highly associated with a particular ST. An ST document is created by automatically extracting one-word Metathesaurus strings belonging to concepts assigned the ST; this set of words consititutes the ST document. For example, the 2002 Metathesaurus contained 187 words in our orgf document (autoregulation, deglutition, healing, locomotion, urination, etc., where these words belonged to concepts assigned the ST Organism Function) and 1,478 words in our topp document (arthroplasty, bandaging, dissection, hemodialysis, immunization, etc., where these words belonged to concepts assigned the ST Therapeutic or Preventive Procedure). Part of the JD vector for the latter ST document is shown in Table 13, consisting of the five highest-ranking JDs and the first and last JDs alphabetically. We performed JD indexing of 129 ST documents (remaining STs did not have enough Metathesaurus words associated with them), resulting in a JD vector for each of them.

 
Table 13. Items in JD vector for topp (Therapeutic or Preventive Procedure) document (arthroplasty, bandaging, dissection, hemodialysis, immunization, etc.).

RankJournal DescriptorScore

83Acquired Immunodeficiency Syndrome0.0213
4Ophthalmology0.3160
5Orthopedics0.3070
1Otolaryngology0.4827
3Surgery0.4740
2Urology0.4803
127Zoology0.0000

Using the standard vector cosine coefficient (Salton & McGill, [1983]), we then computed the similarity, on a scale of 0-1, between the JD vector for each word in the training set and the JD vector for each ST document. Each word and its scores indicating similarity to ST documents (in terms of JD indexing), ordered alphabetically by ST abbreviation, became an entry in the word-ST table (i.e., an ST vector) used for ST indexing, as described earlier.

Looking again at Tables 3, 4, and 5, we now can interpret the items in these ST vectors in terms of similarity to ST documents. That is, JD indexing of implantation is more similar to JD indexing of the topp document than of the orgf document; JD indexing of blastocyst is more similar to JD indexing of the orgf document than of the topp document; JD indexing of sphincter is more similar to JD indexing of the topp document than of the orgf document. Thus, ST indexing selects topp when the ambiguous string implantation occurs in a context (e.g., s1) containing words with JD indexing more similar to that of the topp document; conversely, ST indexing selects orgf when implantation occurs in a context (e.g., s2) containing words with JD indexing more similar to that of the orgf document.

AbstractIntroduction and BackgroundMethodology of JDI-Based ST IndexingRelated WorkExperimental MethodReferences

Word sense disambiguation is a difficult but crucial task in many areas of automatic language processing, such as information retrieval (Clough & Stevenson, [2004]; Vorhees, [1998]), machine translation (Brown, Della Pietra, Della Pietra, & Mercer, [1991]), and question answering (Pasca & Harabagiu, [2001]). Since the late 1950s, numerous solutions to the ambiguity problem have been explored. The growing interest in disambiguation methods and their performance led to formation of SENSEVAL, an international organization devoted to evaluation of word sense disambiguation systems. (Edmonds & Kilgarriff, [2002]; Kilgarriff & Rosenzweig, [2000]; Mihalcea, Chklovsky, & Kilgarriff, [2004]). For a review of existing disambiguation methods, which is beyond the scope of this article, see Ide and Véronis ([1998]). In the following we present work related to JDI because of either the similarity in the approach or the common domain and collection used in the experiments.

The JDI method described in this article combines a statistical, corpus-based method (2-year MEDLINE training set) with utilization of preexisting medical domain knowledge sources, JDs (National Library of Medicine, [2002]) and STs (National Library of Medicine, [2004c]).

Statistical methods are based on the idea that the given context determines the sense of the word. These methods rely on learning disambiguation rules from large sense-tagged corpora. Further distinction in the learning methods is based on the manner in which the text collection is annotated with word senses. Supervised methods that show the best performance in many natural language processing tasks rely on extensive high-quality manual sense tagging of large amounts of text. This dependence restricts application of supervised methods to tasks and domains for which resources exist. Bootstrapping the annotation process with a smaller amount of hand-tagged data or resorting to fully automatic unsupervised methods has been suggested as a way to overcome the data acquisition problem (Yarowsky, [1995]). Approaches that attempt to obtain annotated data but avoid manual annotation have been explored recently. These methods include creating a collection by formulating a query using WordNet definitions of word senses and searching the Web (Mihalcea; & Moldovan, [1999]), eliciting volunteer contributions using a Web-based application (Mihalcea, Chklovsky, & Kilgarriff, [2004]), and employing text in parallel translations (Resnik, [2004]).

In the spirit of avoiding costly manual annotation the JDI method assigns JDs and subsequently STs to the text in the training set, thus preventing a need to discover word senses in untagged text as in clustering-based unsupervised approaches (Pantel & Lin, [2002]; Pedersen & Bruce, [1997]; Schütze, [1992]). Because JD assignment and the subsequent steps are performed automatically, JDI is a rather sophisticated unsupervised approach that creates a representation of word senses (word-ST vectors) by using cooccurrences of words with JDs (word-JD vectors) from the training set with the help of ST assignments to concepts in the UMLS Metathesaurus. Thus, the WSD collection is not used for training.

Using the UMLS and JDs as the source of knowledge is conceptually close to using domain-independent methods that employ preexisting knowledge repositories, such as machine-readable dictionaries or thesauri, for the same purpose. Dictionary-based methods, pioneered by Lesk ([1986]), compare the dictionary definitions of the word senses with the words in the context. These methods differ in the types of source used and the ways in which similarity between the sense representation and the word context is measured and in general do not have the benefit of the sense assigned to the training set provided by JDs. Yarowsky ([1992]) developed a statistical model based on categories of Roget's International Thesaurus and text of the Grolier Encyclopedia. Liddy and Paik ([1993]) and Liddy, Paik, and Woelfel ([1993]) use Subject Field Codes (SFCs) from Longman's Dictionary of Contemporary English (LDOCE); however, the codes are manually assigned to each word in the dictionary by lexicographers rather than being propagated, as in the JDI approach.

Domain Driven Disambiguation (Magnini, Strapparava, Pezzulo, & Gliozzo, [2002]) augments WordNet (Fellbaum, [1998]) with domain labels from the Dewey Decimal Classification to represent the context and the word senses by using domain vectors. Interestingly the kernel-based system that incorporates this method was one of the best performing systems in the SENSEVAL-3 English lexical sample WSD task (Strapparava, Giuliano, & Gliozzo, [2004]). This task, which requires annotation of instances of sample words in short extracts of text, is equivalent to the goal of the JDI method in disambiguating MetaMap output. It may be of interest to note that the average precision of JDI, ranging from 77.10% to 78.73% depending on context (Table 14, as discussed in the Results and Analysis section), is comparable to the precision of the top-performing supervised system participating in this SENSEVAL-3 task, which is 79.3% (Mihalcea, Chklovsky, & Kilgarriff, [2004]).

 
Table 14. Summary and individual precision scores comparing MeSH Frequency disambiguation and JDI (Journal Descriptor Indexing) disambiguation for four contexts studied (doc, ambig-sentence, ambig-sentences, and doc-rule, described in Table 15).

AmbiguitiesMeSH Frequency precisionJDI doc context precisionJDI ambig-sentence context precisionJDI ambig-sentences context precisionJDI doc-rule context precisionNumber of instances

Summary
average0.24920.77100.78600.78730.787054
median0.01520.85070.89390.90480.904863
range0.0000 - 1.00000.0448 - 1.00000.0448 - 1.00000.0448 - 1.00000.0597 - 1.00003 - 67
Individual
adjustment0.10000.81670.63330.75000.766760
Blood_pressure0.00000.40300.44780.41790.417967
condition0.01690.89830.93220.93220.932259
culture0.10451.00000.95520.98511.000067
degree0.00000.93180.95450.95450.977344
depression1.00000.80700.94740.94740.947457
determination0.00001.00001.00001.00001.000054
discharge1.00000.88890.96300.96300.925954
energy0.00000.64180.83580.73130.701567
evaluation0.00000.55220.56720.58210.597067
extraction0.00001.00000.98310.98310.983159
failure0.00001.00000.94440.94440.944418
fat0.95830.62500.79170.75000.750048
fit0.00001.00001.00001.00001.000012
fluid0.00000.04480.04480.04480.059767
frequency0.00000.88890.96830.90480.904863
ganglion0.94030.94030.94030.94030.940367
glucose0.92540.41790.35820.38810.388167
growth0.00000.74630.65670.70150.701567
immunosuppression0.52240.68660.68660.76120.746367
implantation0.16670.89390.89390.92420.939466
inhibition0.00000.98510.92541.00000.985167
japanese0.00000.47170.58490.56600.547253
lead0.88890.27780.38890.38890.388918
mole0.01821.00000.98180.98180.981855
mosaic0.00000.69230.67690.67690.676965
nutrition0.17740.40320.38710.38710.354862
pathology0.14930.71640.74630.74630.746367
Pressure1.00000.13640.10610.12120.121266
radiation0.42420.80300.75760.80300.787966
reduction0.00001.00001.00001.00001.000010
repair0.27270.93180.86360.86360.863644
resistance0.00001.00001.00001.00001.00003
scale0.00000.51160.72090.62790.604743
secretion0.01490.91040.94030.94030.940367
sensitivity0.00000.82860.88570.82860.828635
single0.00000.97010.98510.98511.000067
strains0.00000.95160.96770.98390.983962
support0.00001.00001.00001.00001.00007
surgery0.01490.85070.98510.92540.925467
transient0.00001.00001.00000.98510.985167
transport0.98441.00000.95310.96880.984464
ultrasound0.82090.80600.85070.80600.806067
variation0.17910.71640.65670.70150.731367
white0.53330.55000.50000.53330.550060

 
Table 15. Contexts for ambiguous instances.

Context nameDescription

ambig-sentenceThe one sentence containing the ambiguous string reviewed by raters (which we call the target sentence)
docThe entire citation
ambig-sentencesAll sentences containing the ambiguous string or its variants
doc-ruleIf ambig-sentence = ambig-sentences and ambig-sentence has fewer words than some threshold, then use doc

Maynard and Ananiadou ([2000]) use the UMLS and Semantic Network and the strength of association between a multiword term and its context to identify one sense for that term in the corpus. Here again JDI of the training set permits finer granularity of the sense assignment: i.e., the word can be disambiguated given a paragraph or a sentence.

The idea of disambiguating terms in the biomedical context by using the UMLS semantic types of unambiguous neighboring concepts was introduced by Aronson, Rindflesch, and Browne ([1994]). The availability of an extensive knowledge source such as UMLS has potential to reduce significantly or even eliminate the need for manual sense annotation. One such unsupervised approach was studied by Widdows and colleagues ([2003]), who augmented information about concepts and semantic types with information about cooccurring concepts also contained in UMLS. In this approach, first all possible senses are found for each ambiguous word. Then all conceptually related and coindexing terms for each sense are extracted from the corresponding sources (conceptually related terms can be found in the UMLS MRREL and MRCXT files, and the UMLS MRCOC file contains the coindexing terms). Then the local context of the ambiguous word is examined for the presence of the related concepts. The sense that is supported by the largest number of related terms in the context is assigned to the ambiguous word. This study found both precision and recall to be better when only coindexing terms were used for disambiguation as opposed to the combination of the coindexing and hierarchically related terms. In another unsupervised approach Liu, Johnson, and Friedman ([2002b]) used the MRREL file to annotate related concepts in MEDLINE citations automatically. The presence of conceptual relatives permitted determination of the sense of the ambiguous word in a large number of citations. The remaining citations were disambiguated by using a naive Bayes classifier trained on the previously disambiguated texts.

Because both unsupervised methods described rely on the presence of related concepts in the citation, they might be sensitive to the exact wording of the text in the same manner that the early methods that used machine-readable dictionaries as the knowledge source were sensitive to the wording of the sense definitions. The advantage of the JDI method is that it does not require the presence of specific words in the text that contain the ambiguity (i.e., all words are prelabeled with JDs inherited by the training set documents from the journals they appear in, and then labeled with STs according to the methodology explained in the previous section), and thus it is not necessary to have large numbers of examples with these specific words.

Although our method is not supervised, two experiments that used parts of the NLM's WSD collection for supervised word sense disambiguation should be mentioned. Liu, Teller, and Friedman ([2004]) studied various sizes of immediate contexts to the right and to the left of the ambiguous word for training of machine learning algorithms that demonstrated high accuracy in general English word sense disambiguation, namely, naive Bayes, decision list, and a combination of a naive Bayes and an instance-based classifier. Because none of the classifiers in this experiment outperformed the rest for all ambiguities, the authors recommend selecting the best classifier individually for each term, and using supervised WSD only when there are at least a few dozen instances tagged for each sense of the word. Leroy and Rindflesch ([2004]) studied the possibility of reducing the size of the required training set by utilizing symbolic knowledge encoded in the UMLS. In this experiment a naive Bayes classifier was trained on sentences containing ambiguous words that were represented by using a combination of syntactic features, semantic types found in the sentence, and semantic network relations, such as part-of, between these semantic types. We compare the performance of JDI to these methods in the Results and Analysis section.

AbstractRelated WorkExperimental MethodResults and AnalysisReferences

A Word Sense Disambiguator interface has been developed to determine the performance of individual disambiguation methods on the WSD Test Collection (Figure 3). This interface was used for running the baseline MeSH Frequency method (described later) and the JDI method to be compared to it. We have used Disambiguator in an experiment to measure the performance of MeSH Frequency and four versions of JDI corresponding to different contexts in which the ambiguity occurs, as described later in this section.

Figure 3. Word Sense Disambiguator interface where the indexing method (e.g., MeSH Frequency Method) and ambiguities, e.g., implantation, are selected.
[Normal View 63K | Magnified View 90K]

MeSH Frequency uses frequency counts of MeSH indexing terms in a subset of MEDLINE citations. (MeSH Frequency forms the baseline for developing JDI but is not used in an implemented system). Each candidate concept for an ambiguity is matched to a MeSH synonym, if there is one. The concept that has the MeSH synonym with the highest frequency count in MEDLINE is returned as the Disambiguator answer. Figure 4 shows the first few lines of the results for MeSH Frequency in disambiguating the instances of the implantation ambiguity discussed in previous sections of this article. (Only 67 instances are processed as a training set for disambiguation methods; the remaining 33 are reserved as a test set.) In a line of results, the Item ID identifies the ambiguous text. For example, in the last line of Figure 4, 9344537.ab.1 stands for the first sentence in the abstract in the citation with PMID 9344537. Next on the line is the reviewed answer from the consensus of human raters, followed by the Disambiguator answer for the particular method that was selected, in this case Word Frequency. Clicking on this Item ID displays the citation with the sentence containing the ambiguity highlighted (Figure 5). This display is similar to the one shown to human raters in developing the WSD Test Collection. Also highlighted is the ambiguity in other sentences, although raters focused on the highlighted sentence for the disambiguation. This display is informative in evaluation of automatic indexing methodologies by allowing viewing of the context of the ambiguity. The ambiguous text in Figure 5 is our sample s1 sentence.

Figure 4. Word Sense Disambiguator display for MeSH Frequency results for implantation ambiguity, where Blastocyst Implantation, natural is the Disambiguator answer for all 67 instances.
[Normal View 78K | Magnified View 112K]

Figure 5. Word Sense Disambiguator display for MeSH Frequency results for particular implantation ambiguity item corresponding to s1.
[Normal View 60K | Magnified View 86K]

Referring to Figure 4, for implantation, the MeSH Frequency method selects Blastocyst Implantation, natural as the correct concept for all 67 instances. This is the reviewed answer for only 11 instances and is reflected in the (TP) True Positive number in the Overall Summary line. Precision in this line is the precision score of 0.1642, which is TP/ Count (total count of 67). The reason for this poor performance is that this concept has a MeSH synonym (Ovum Implantation), but the other concept, Implantation procedure, has no MeSH synonym. The Overall Summary also gives counts and scores, ignoring the instances in which None of the Above is the reviewed answer. For this ambiguity, there was only one None of the Above; therefore, ignoring this instance, Count = 66, and Precision = 11 ÷ 66 = 0.1667. We are using scores that ignore None of the Above because neither MeSH Frequency nor the JDI method is designed to return this answer (see discussion of this point at the end of this section).

As shown in Table 14, the average score for MeSH Frequency is 0.2491, which is the average of the precision scores for the 45 ambiguities processed by this method in the experiment (see discussion on elimination of five ambiguities at the end of this section). Practically half the ambiguities have a precision score of 0.0000 (the Disambiguator answer is No match found for all instances) because of the absence of MeSH synonyms for all candidate concepts. In cases in which performance is good for this method, the concept that has the MeSH synonym with the highest frequency happens to be correct for most instances.

A particular methodologic issue that arises for the JDI method is what the context for an ambiguous instance should be. Should it be just the sentence in which the ambiguous string appears (i.e., target sentence)? Should it be the entire citation? An alternative context for the citation is the target sentence together with other sentences containing the ambiguity, or morphological variant of the ambiguity. Variants were determined by using the UMLS SPECIALIST Lexicon; for example, variants of the ambiguous string culture are cultures, cultured, culturing, cultural. A question arose in the situation in which the desired context is all sentences with the ambiguity/variants, but there is only one sentence that qualifies, i.e., the one with the ambiguity. Is some additional context always desirable beyond this sentence? We therefore derived a rule that if this sentence has fewer unique words than some threshold, the system goes to the entire citation as context. Table 15 summarizes the contexts in our preliminary experiments.

Results of JDI using the various contexts for the 45 remaining ambiguities will be presented in the Results and Analysis section for comparison with one another and with MeSH Frequency.

Five of the ambiguities were eliminated for this experiment: association, cold, man, sex, and weight. The last four of these are each mapped to two concepts that have the same ST. For example, weight is mapped to the concepts Body Weight and Weight, both of which are assigned the ST qnco (in addition, Body Weight is mapped to orga); for the more than 40 instances in which JDI found qnco to be the better ST (than orga), the system had no way of knowing which of the two concepts to select because they were both assigned this same ST.

A more pervasive problem occurred when None of the Above was the reviewed answer. The JDI method must decide as to the best ST (unless, as rarely happens, the context is empty), hence the best Disambiguator answer. Thus, when the reviewed answer for either MeSH Frequency or JDI was None of the Above, the Disambiguator answer was always incorrect. Because neither method was designed to return None of the Above, the researchers decided to present and therefore concentrate on results that ignore those instances with this reviewed answer. Because all reviewed answers for the ambiguity association were None of the Above, this ambiguity was eliminated altogether. A side effect of ignoring None of the Above was to reduce the total number of instances by more than half for the ambiguities failure, fit, lead, reduction, resistance, and support, but these were included in the results anyway. One can assume that raters selected None of the Above for many instances of these six ambiguities because the ambiguities are common English words that correspond to concepts not found in the Metathesaurus.

AbstractExperimental MethodResults and AnalysisFuture WorkReferences

We ran the ambiguities comparing MeSH Frequency and the various JDI contexts. Summary precision scores and individual precision scores for the 45 ambiguities are presented in Table 14. JDI, regardless of context, performed significantly better than MeSH Frequency, with average precision of .2491, versus average precision ranging from 0.7710 to 0.7873 for the JDI contexts. The median precision for MeSH Frequency was 0.0152 versus a median precision ranging from 0.8507 to 0.9048 for the JDI contexts. Of the 45 ambiguities, 22 had 0.0000 precision score (see discussion of MeSH Frequency in the previous section for explanation) versus none for JDI.

Three of the JDI contexts (ambig-sentence, ambig-sentences, and doc-rule) approached 79% average precision; the remaining context (doc) had an average precision of 77%. The context giving the best average precision score was ambig-sentences. The doc-rule context resulted in only a slightly lower score, a result that is not surprising because, in the instances in which there was more than one sentence containing the ambiguity, ambig-sentences was used under doc-rule as well. The ambig-sentence context scored slightly lower than doc-rule and ambig-sentences, suggesting that, on average, just the target sentence may be too little context compared to those contexts. Figure 6 is an example in which a target sentence containing the ambiguity implantation - No serious complication resulted from implantation of FOE in this series. - resulted in the incorrect answer Blastocyst Implantation, natural rather than Implantation procedure because the ST orgf had a higher score than topp for this sentence. In particular, the acronym FOE was not helpful, as in the training set it usually appears in the context of friend or foe and the word foe generates a higher score for orgf (which ranks 25th among the STs) than for topp (which ranks 52nd). The ambig-sentences context, which used all four sentences containing implantation, gave the correct answer, as did the doc context (all 14 sentences in the citation). On average, doc scored lowest, suggesting that the entire document may be too much context compared to the others.

Figure 6. Example of target sentence with too little context including the acronym FOE which contributes to the wrong answer.
[Normal View 79K | Magnified View 113K]

The data were analyzed in terms of the number of ambiguities for which each context performed best (precision was best or tied for best), worst (precision was worst or tied for worst), or intermediate (Table 16). The contexts doc and ambig-sentence had the best precision for 21 and 20 ambiguities, respectively, and the worst precision for 22 and 18 ambiguities, respectively; these contexts performed either best or worst. The doc-rule context had the best performance for 20 ambiguities compared to 15 for ambig-sentences, and they were tied at 9 ambiguities for worst performance. Thus, in this analysis, it would seem that doc-rule had the edge in terms of optimal performance (balancing best and worst precision). Ignoring ambiguities in which the difference between best and worst performance was less than 0.0200 (extraction, mole, mosaic, and transient), the data suggest that doc, which was best for 17 ambiguities and worst for 22 ambiguities, fared poorest in terms of optimal performance, whereas doc-rule (best for 20 ambiguities and worst for 5) remained in terms of optimal performance the best. Ranked second and third for optimal performance would be ambig-sentences and ambig-sentence, respectively.

 
Table 16. Comparison of JDI contexts in terms of number of ambiguities where precision was best, worst, and intermediate, suggesting optimum performance.

ContextNumber of ambiguities best precisionNumber of ambiguities worst precisionNumber of ambiguities intermediate precisionTotal Number of ambiguities

doc2117a2222a22a4541a
ambig-sentence2221a1815a55a4541a
ambig-sentences1515a95a2121a4541a
doc-rule2020a95a1616a4541a

   aIgnoring ambiguities extraction, mole, mosaic, and transient, where the difference between worst and best precision was < 0.0200.

We compare the optimally performing JDI method, doc rule, to two supervised methods using the WSD collection. In general, precision of JDI is comparable to that of these other methods. Table 17 compares JDI to the best overall naive Bayes classifier in Leroy and Rindflesch ([2004]) for the 13 ambiguities classified by both methods. For 9 ambiguities, JDI precision is higher, and average JDI precision is higher. Although the Liu and associates ([2004]) experiment does not permit a side-by-side comparison, performance of all supervised classifiers (precision around 80%) on 22 of the original 50 ambiguities is comparable to that of the methods presented in Table 14.

 
Table 17. Comparison of best overall JDI disambiguation method and naive Bayes classifier method.

AmbiguitiesJDI precisionNaive Bayes precision

adjustment.7667.57
blood_pressure.4179.46
degree.9773.68
evaluation.5970.57
growth.7015.62
immunosuppression.7463.63
mosaic.6769.66
nutrition.3548.48
radiation.7879.72
repair.8636.81
scale.6047.84
sensitivity.8286.70
white.5500.62
Average.6826.64

We have begun to analyze JDI performance failure (which we define as less than 0.6500) by examining individual ambiguities. The following are some observations (refer to Appendixes A, B, and C for choices of meaning and ST) regarding poor performance:

1.Difficulty in distinguishing between chemicals and laboratory procedures: Examples include lead and glucose. In fact, the text strings lead and glucose each result in lbpr as the preferred ST, compared to elii for the former and to bacs and carb for the latter. That is, these strings have a higher association with laboratory procedure terms than with substance terms. Furthermore, sentences containing these words tend to have cooccurring words denoting laboratory procedures, thus boosting the lbpr score.
2.Difficulty in distinguishing between physiologic functions and their measurement or determination or the functions in terms of findings, for example blood_pressure, in which the text has a higher association with diap and lbtr than with orgf.
3.Idiosyncratic Metathesaurus meanings and ST assignments, for example, pressure, in which one of the meanings is the concept Baresthesia (pressure sensation, or the physiologic discrimination of various degrees of pressure on the surface). In the ambig-sentences context, 46 of the 58 incorrect answers involved Baresthesia as the incorrect answer.
4.System's nonselection of very general ST over a very common ST, for example, fluid, in which the correct ST was sbst for every instance, in contrast to qlco, but it was selected by the system only 3 of 67 times for the ambig-sentences context.
5.Difficulty in distinguishing between STs for two types of general activity, for example, evaluation, which requires distinguishing between hlca (the most general health care activity ST) and resa (research activity ST).
6.Difficulty in distinguishing between STs that share semantic features, for example, nutrition, which may require selecting between semantically related STs orga and orgf as the correct ST, and japanese, which requires selecting between STs popg and lang.
7.Ambiguities in which the context often does not reflect the ST of the meaning of the ambiguity. For example, human raters selected the topp meaning for the following ambig-sentences context for nutrition (in which the ambiguity is the variant nutritional): If women have a different metabolic response to the human immunodeficiency virus (HIV), nutritional advice may differ from HIV-seropositive men. Therefore, nutritional advice may need to vary according to the gender of the asymptomatic HIV-seropositive subject. The system's selection for the context was orga because this was the best ST for many of the words (e.g., immunodeficiency, seropositive, HIV, virus).

For some of these poor-performance ambiguities it is also the case that the contexts corresponding to the meanings can be expected to be similar (i.e., have similar vocabularies) to one another. On the other hand, for several ambiguities in which system performance was good (which we define as less than 0.8500) the contexts corresponding to different meanings can be expected to be quite different. This difference, in turn, can be translated into contrasting STs that correspond to the words in the contexts to which JDI is sensitive. Examples of good performance include ambiguities involving the following:

1.Natural or physiologic processes versus intentional procedures: reduction (npop hlca), transport (celf hlca), implantation (orgf topp)
2.Laboratory versus nonlaboratory environment: determination (gora lbpr), culture (idcn lbpr), extraction (topp lbpr)
3.Temporality versus nontemporality: transient (popg tmco), frequency (tmco sosy)
4.Mental versus nonmental: inhibition (menp moft), resistance (menp socb), depression (ftcn mobd), condition (qlco menp)
5.Social versus nonsocial: support (socb medd), failure (patf socb)

AbstractResults and AnalysisFuture WorkConclusionsAcknowledgmentReferences

Future work falls into two categories: improving the JDI methodology and studying the use of JDI in applications.

Improving the JDI methodology (see Methodology of JDI-Based Indexing) includes updating the ST documents on the basis of the latest version (2004) of the UMLS Metathesaurus. The ST documents we are using were developed in 2002. Another aspect of the methodology we will examine are the stopwords and restrictwords lists. An extensive stopword list, developed empirically, is now being used. Using JDI, we may be able to identify what constitutes a good stopword by comparing the JD vectors of generally agreed upon stopwords with candidate stopwords. Improving the methodology includes improving its general application for solving the None of the Above problem. For example, if the candidate STs all score very low, is this an indication that none of them is appropriate? We also can try to adopt methods for identifying acronyms (Liu, Aronson, & Friedman, [2002a]; Schwartz & Hearst, [2003]; Wren & Garner, [2002]; Yu, Hripcsak, & Friedman, [2002]), substituting the full form for the acronym. For example, if the full form foramen ovale electrode had been substituted for FOE in the target sentence shown in Figure 6, the correct ST would have resulted. We can test changes on the WSD test collection.

Disambiguation by means of JDI is already being used in experimental systems at NLM, specifically in SemGen - adapted from the natural language processing (NLP) program SemRep - that identify gene interaction predications from MEDLINE citations (Libbus, Kilicoglu, Rindflesch, Mork, & Aronson, [2004]; Rindflesch, Libbus, Hristovski, Aronson, & Kilicoglu, [2003]). JDI increases accuracy by identifying citations in the molecular genetics domain before NLP begins. JDI has also been explored for gene symbol disambiguation in connection with BITOLA, an interactive literature-based biomedical discovery support system (Hristovski, Peterlin, Mitchell, & Humphrey, [2005]) by being able to determine, for example, that the document title Ethics in a twist: Life Support, BBC1 is outside the genetics domain, thereby, in effect, disambiguating the British television station BBC1, as in this title, from the symbol BBC1 for the breast basic conserved 1 gene. On the basis of the experiment described in the current article, perhaps JDI can be studied further in applications necessitating WSD of strings according to various meanings associated with STs.

AbstractResults and AnalysisFuture WorkConclusionsAcknowledgmentReferences

We have described an experiment using NLM's WSD test collection to compare four versions of the Journal Descriptor Indexing methodology (based on extent of context) to a baseline MeSH Frequency methodology. For the 45 ambiguities studied, the overall average precision of the highest-scoring JDI method was 0.7873 compared to 0.2492 for MeSH Frequency. Furthermore, for the 45 individual ambiguities, average precision was greater than 0.90 for 23 (51%) of them, greater than 0.85 for 24 (53%), and greater than 0.65 for 35 (79%). On the basis of these results we believe that JDI shows promise as an unsupervised method for WSD using ready-made resources at NLM - JDs assigned to journals and thus automatically assigned to words in a large MEDLINE training set; UMLS Metathesaurus concepts assigned to STs and thus serving as ST documents (sets of words labeled by the STs). JDI uses these resources to automatically prelabel words in the training set with JDs and then with STs. Our method obviates the need to hand tag a training set for word senses as in supervised methods. We hope to improve the performance of JDI and test its use in actual applications.

AbstractResults and AnalysisFuture WorkConclusionsAcknowledgmentReferences

This research was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.

 
Table A1. Word Sense Disambiguation Test Collection ambiguities and respective semantic types and Metathesaurus concepts.

adjustmentftcn Adjustment Action; inbe Individual Adjustment; menp Psychological Adjustment

associationmenp Mental association; socb Relationship by association

blood_pressurelbtr Arterial pressure; orgf Blood Pressure <1>; diap Blood Pressure Determination

colddsyn Common Cold Chronic Obstructive Airway Disease; qlco Cold Sensation;

sosy Cold Sensation; topp Cold Therapy; npop cold temperature

conditionqlco Condition; menp Conditioning (Psychology)

cultureidcn Anthropological Culture; lbpr Laboratory culture

degreeqlco degree <1>; inpr degree <2>

depressionftcn Depression motion; mobd Mental Depression

determinationgora adjudication; lbpr determination <2>

dischargebdsu Discharge, Body Substance; hlca Patient Discharge

energynpop Energy (physics); fndg Vitality

evaluationinpr Evaluation; resa Evaluation; hlca Health evaluation

extractiontopp Extraction, NOS; lbpr extraction <1>

failurepatf Failure, NOS; socb failure <1>

fatlipd Fatty acid glycerol esters; orga Obese build

fitfndg Fit and well; dsyn Siezures; sosy Siezures

fluidqlco Fluid <2>; sbst Liquid substance, NOS

frequencytmco Frequencies; sosy Increased frequency of micturation

ganglionacab Benign cystic mucinous tumour; bpoc Ganglia

glucosebacs Glucose; carb Glucose; lbpr Glucose measurement

growthorgf Growth <1>; ftcn growth <2>

immunosuppressionorgf Natural immunosuppression; topp Therapeutic immunosuppression

implantationorgf Blastocyst Implantation, natural; topp Implantation procedure

inhibitionmenp Psychological inhibition; moft inhibition, physical

japanesepopg Japanes Population; lang Japanese language

leadelii Lead; lbpr Lead measurement, quantitative

manhumn Homo sapiens; popg Men Homo sapiens; orga Male

moleneop Benign melanocytic nevus of skin; mamm Mole the mammal; qnco mol

mosaicinpr Mosaic <4>; orga Mosaicism <1>; spco Spatial Mosaic

nutritiontopp Feeding and dietary regimes; orga Nutrition; bmod Science of nutrition; orgf Science of nutrition

pathologybmod Pathology); patf pathology <3>

pressureortf Baresthesia; topp Pressure - action; qnco Pressure- physical agent

radiationnpop Electromagnetic Energy; topp Radiation therapy

reductionnpop Reduction (chemical); hlca Reduction - action

repairtopp Repair - action; orgf Wound Healing

resistancemenp Resistance <2>; socb resistance <1>

scalebpoc Integumentary scale; inpr Intellectual scale; mnob Weight measurement scales

secretionbdsu Bodily secretions; biof secretion <3>

sensitivitylbtr Antimicrobial susceptibility; fndg Personality sensitivity; menp Personality sensitivity; qnco Statistical sensitivity

sexinbe Coitus; orgf Coitus; orga Gender Sex <2>

singleqnco Singular; popg Unmarried <2>

strainsinpr Microbiology subtype strains; inpo Muscle strain

supportsocb Support; medd Support, NOS

surgerytopp Surgery <3>; bmod Surgery specialty

transientpopg Transient Population Group; tmco Transitory

transportcelf Biological Transport; hlca Patient Transport

ultrasoundnpop Ultrasonic Shockwave; diap Ultrasonography

variationqlco Variant; npop Variation (Genetics)

weightorga Body Weight; qnco Body Weight Weight;

whitePopg Caucasoid Race; qlco White color

 
Table B1. Semantic Type abbreviations and corresponding full forms represented in the Word Sense Disambiguation Test Collection.

acab Acquired Abnormality

bacs Biologically Active Substance

bdsu Body Substance

biof Biologic Function

bmod Biomedical Occupation or Discipline

bpoc Body Part, Organ, or Organ Component

carb Carbohydrate

celf Cell Function

diap Diagnostic Procedure

dsyn Disease or Syndrome

fndg Finding

ftcn Functional Concept

gora Government or Regulatory Activity

hlca Health Care Activity

humn Human

idcn Idea or Concept

inbe Individual Behavior

inpr Intellectural Product

lang Language

lbpr Laboratory Procedure

lbtr Laboratory or Test Result

lipd Lipid

mamm Mammal

medd Medical Device

menp Mental Process

mnob Manufactured Object

mobd Mental or Behavioral Dysfunction

moft Molecular Function

neop Neoplastic Process

npop Natural Phenomenon or Process

orga Organism Attribute

orgf Organism Function

ortf Organ or Tissue Function

patf Pathologic Function

popg Population Group

qlco Qualitative Concept

qnco Quantitative Concept

resa Research Activity

sbst Substance

socb Social Behavior

sosy Sign or Symptom

spco Spatial Concept

tmco Temporal Concept

topp Therapeutic or Preventive Procedure

Figure C1. Hierarchical view of Semantic Types abbreviations and corresponding full forms represented in the Word Sense Disambiguation Test Collection (shown in boldface among more general STs not represented in the collection).
[Normal View 33K | Magnified View 76K]

References
AbstractResults and AnalysisFuture WorkConclusionsAcknowledgmentReferences
Aronson, A.R. (2001). Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In Proceedings of the American Medical Informatics Association Annual Symposium (AMIA 2001) (pp. 17-21). Philadelphia: Hanley & Belfus.
Aronson, A.R., Bodenreider, O., Chang, H.F., Humphrey, S.M., Mork, J.G., Nelson, S.J., et al. (2000). The NLM Indexing Initiative. In Proceedings of the American Medical Informatics Association Annual Symposium (AMIA 2000) (pp. 17- 21). Philadelphia: Hanley & Belfus.
Aronson, A.R., Mork, J.G., Gay, C.W., Humphrey, S.M., & Rogers, W.J. (2004). The NLM Indexing Initiative's Medical Text Indexer. Medinfo , 11(Pt 1), 368-372. Links  
Aronson, A.R., Rindflesch, T.C., & Browne, A.C. (1994). Exploiting a large thesaurus for information retrieval. In Proceedings RIAO-94 Conference (pp. 197-216). Paris: CID.
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., & Mercer, R.L. (1991). Word-sense disambiguation using statistical methods. In Proceedings of the 29th Conference of the Association for Computational Linguistics (pp. 264-270). San Francisco: Morgan Kaufmann.
Clough, P., & Stevenson, M. (2004). Cross-language information retrieval using EuroWordNet and word sense ambiguation. In S. MacDonald & J. Tait (Eds.), Lecture Notes in Computer Science 2997: Advances in Information Retrieval, 26th Conference on IR Research, ECIR 2004 Proceedings (pp. 327-337). Heidelberg, Germany: Springer.
Edmonds, P., & Kilgarriff, A. (2002). Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineering , 8, 279-291. Links  
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.
Hristovski, D., Peterlin, B., Mitchell, J.A., & Humphrey, S.M. (2005). Using literature-based discovery to identify disease candidate genes. International Journal of Medical Informatics , 74, 289-298. Links  
Humphrey, S.M. (1998). A new approach to automatic indexing using journal descriptors. In C.M. Preston (Ed.). Proceedings of the 61st ASIS Annual Meeting (pp. 496-500). Medford, NJ: Information Today.
Humphrey, S.M. (1999). Automatic indexing of documents from journal descriptors: A preliminary investigation. Journal of the American Society for Information Science , 50, 661-674. Links  
Humphrey, S.M., Rindflesch, T.C., & Aronson, A.R. (2000). Automatic indexing by discipline and high-level categories: Methodology and potential applications. In Proceedings of the 11th ASIST SIG/CR Classification Research Workshop (pp. 103-116). Silver Spring, MD: American Society for Information Science and Technology.
Ide, N., & Véronis, J. (1998). Word sense disambiguation: The state of the art. Computational Linguistics , 24, 1-40. Links  
Kilgarriff, A., & Rosenzweig, J. (2000). Framework and results for English SENSEVAL. Computers and the Humanities , 34, 15-48. Links  
Leroy, G., & Rindflesch, T.C. (2004). Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a naive Bayes classifier. Medinfo , 11(Pt 1), 381-385. Links  
Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the Fifth Annual International Conference on Systems Documentation (SIGDOC) (pp. 24-26). New York: Association for Computing Machinery.
Libbus, B., Kilicoglu, H., Rindflesch, T.C., Mork, J.G., & Aronson, A.R. (2004). Using natural language processing, LocusLink, and the Gene Ontology to compare OMIM to MEDLINE. In HLT-NAACL 2004 Workshop: BioLINK 2004, Linking Biological Literature, Ontologies and Databases (pp. 69-76). East Stroudsburg, PA: Association for Computational Linguistics.
Liddy, E.D., & Paik, W. (1993). From handcrafted dictionary subject codes to statistically-guided word sense disambiguation. In Probabilistic Approaches to Natural Language, Papers from the AAAI Fall Symposium, Technical Report FS-92-05 (pp. 98-107). Menlo Park, CA: AAAI Press.
Liddy, E.D., Paik, W., & Woelfel, J.K. (1993). Use of subject field codes from a machine-readable dictionary for automatic classification of documents. In Advances in Classification Research, Proceedings of the Third ASIS SIG/CR Classification Workshop (pp. 83-100). Medford, NJ: Learned Information.
Liu, H., Aronson, A.R., & Friedman, C. (2002a). A study of abbreviations in MEDLINE abstracts. In Proceedings of the American Medical Informatics Association Annual Symposium (AMIA 2002) (pp. 464- 468). Philadelphia: Hanley & Belfus.
Liu, H., Johnson, S.B., & Friedman, C. (2002b). Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS. Journal of the American Medical Informatics Association , 6, 621-636. Links  
Liu, H., Teller, V., & Friedman, C. (2004). A multi-aspect comparison study of supervised word sense disambiguation. Journal of the American Medical Informatics Association , 11, 320-331. Links  
Magnini, B., Strapparava, C., Pezzulo, G., & Gliozzo, A. (2002). The role of domain knowledge in word sense disambiguation. Natural Language Engineering , 8, 359-373. Links  
Maynard, D., & Ananiadou, S. (2000). Trucks: A model for automatic multiword term recognition. Journal of Natural Language Processing , 8, 101-126. Links  
Mihalcea, R., & Moldovan, D.I. (1999). An automatic method for generating sense tagged corpora. In Proceedings of the 16th National Conference on Artificial Intelligence, 11th Conference on Innovative Applications of Artificial Intelligence (pp. 461-466). Menlo Park, CA: American Association for Artificial Intelligence.
Mihalcea, R., Chklovsky, T., & Kilgarriff, A. (2004). The SENSEVAL-3 English lexical sample task. In Proceedings of SENSEVAL-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text [CD-ROM] (pp. 25-28). New Brunswick, NJ: Association for Computational Linguistics.
National Library of Medicine (2002). List of journals indexed in Index Medicus 2002. NIH Publication No. 02-267. Bethesda, MD: National Library of Medicine.
National Library of Medicine. (2004a). Indexing initiative. Retrieved May 19, 2004, from http://ii.nlm.nih.gov/
National Library of Medicine. (2004b). Medical subject headings. Retrieved May 19, 2004, from http://www.nlm.nih.gov/mesh/200/MBrowser.html.
National Library of Medicine. (2004c). Unified medical language system. Retrieved May 19, 2004, from http://nlm.nih.gov/research/umls/
Pantel, P., & Lin, D. (2002). Discovering word senses from text. In Conference on Knowledge Discovery in Data, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 613-619). New York: ACM Press.
Pasca, M., & Harabagiu, S. (2001). The informative role of WordNet in open-domain question answering. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-01), Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations (pp. 138-143). East Stroudsburg, PA: Association for Computational Linguistics.
Pedersen, T., & Bruce, R. (1997). Distinguishing word senses in untagged text. In C. Cardie & R. Weischedel (Eds.), Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-2) (pp. 197-207) Somerset, NJ: Association for Computational Linguistics.
Resnik, P. (2004). Exploiting hidden meanings: Using bilingual text for monolingual annotation. In A. Gelbukh (Ed.), Lecture Notes in Computer Science 2945: Computational Linguistics and Intelligent Text Processing: Fifth International Conference, CICLing 2004 Proceedings (pp. 283-299), Heidelberg, Germany: Springer.
Rindflesch, T.C., Libbus, B., Hristovski, D., Aronson, A.R., & Kilicoglu, H. (2003). Semantic relations asserting the etiology of genetic diseases. In Proceedings of the American Medical Informatics Association Annual Symposium [electronicresource](AMIA 2003) (pp. 554- 558). Bethesda, MD: American Medical Informatics Association.
Salton, G., & McGill, M.J. (1983). Introduction to modern information retrieval (p. 124). New York: McGraw-Hill.
Schütze, H. (1992). Dimensions of meaning. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (pp. 787-796). Los Alamitos, CA: IEEE Computer Society Press.
Schwartz, A.S., & Hearst, M.A. (2003). A simple algorithm for identifying abbreviation definitions in biomedical text. In Pacific Symposium on Biocomputing (pp. 451-462). River Edge, NJ: World Scientific.
Strapparava, C., Giuliano, C., & Gliozzo, A. (2004). Pattern abstraction and term similarity for word sense disambiguation: IRST at SENSEVAL-3. In Proceedings of SENSEVAL-3: The Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text [CD-ROM] (pp. 229-234). New Brunswick, NJ: Association for Computational Linguistics.
Vorhees, E. (1998). Using WordNet for text retrieval. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 285-303). Cambridge, MA: MIT Press.
Weeber, M., Mork, J.G., & Aronson, A.R. (2001). Developing a test collection for biomedical word sense disambiguation. In Proceedings of the American Medical Informatics Association Annual Symposium (AMIA 2001). (pp 746- 750). Philadelphia: Hanley & Belfus.
Widdows, D., Peters, S., Cederberg, S., Chan, C.-K., Steffen, D., & Buitelaar, P. (2003). Unsupervised monolingual and bilingual word-sense disambiguation of medical documents using UMLS. In Natural Language Processing in Biomedicine ACL 2003 Workshop (pp. 9-16). East Stroudsburg, PA: Association for Computational Linguistics.
Wren, J.D., & Garner, H.R. (2002). Heuristics for identification of acronym-definition patterns within text: Towards an automated construction of comprehensive acronym-definition dictionaries. Methods of Information in Medicine , 41, 426-434. Links  
Yarowsky, D. (1992). Word-sense disambiguation using statistical models of Roget's categories trained on large corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92) (pp. 454-460). International Committee on Computational Linguistics. East Stroudsburg, PA: Association for Computational Linguistics.
Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association for Computational Linguistics, Proceedings of Conference (pp. 189-196). San Francisco: Morgan Kaufmann,
Yu, H., Hripcsak, G., & Friedman, C. (2002). Mapping abbreviations to full forms in electronic articles. Journal of the American Medical Informatics Association , 9, 262-272. Links