PubMed Related Citations Algorithm
Home
The neighbors of a document are those documents in the
database that are the most similar to it. The similarity between documents
is measured by the words they have in common, with some adjustment for document
lengths. To carry out such a program, one must first define what
a word is. For us, a word is basically an unbroken string of letters and
numerals with at least one letter of the alphabet in it. Words end at hyphens,
spaces, new lines, and punctuation. A list of 310 common, but uninformative,
words (also known as stopwords) are eliminated from processing at this
stage. Next, a limited amount of stemming of words is done, but no thesaurus
is used in processing. Words from the abstract of a document are classified
as text words. Words from titles are also classified as text words, but
words from titles are added in a second time to give them a small advantage
in the local weighting scheme. MeSH terms are placed in a third category,
and a MeSH term with a subheading qualifier is entered twice, once without
the qualifier and once with it. If a MeSH term is starred (indicating a
major concept in a document), the star is ignored. These three categories
of words (or phrases in the case of MeSH) comprise the representation of
a document. No other fields, such as Author or Journal, enter into the calculations.
Having obtained the set of terms that represent
each document, the next step is to recognize that not all words are of
equal value. Each time a word is used, it is assigned a numerical weight.
This numerical weight is based on information that the computer can obtain
by automatic processing. Automatic processing is important because the
number of different terms that have to be assigned weights is close to
two million for this system. The weight or value of a term is dependent
on three types of information: 1) the number of different documents in
the database that contain the term; 2) the number of times the term occurs
in a particular document; and 3) the number of term occurrences in the document.
The first of these pieces of information is used to produce a number called
the global weight of the term. The global weight is used in weighting the
term throughout the database. The second and third pieces of information
pertain only to a particular document and are used to produce a number
called the local weight of the term in that specific document. When a word
occurs in two documents, its weight is computed as the product of the global
weight times the two local weights (one pertaining to each of the documents).
The global weight of a term is greater for the less
frequent terms. This is reasonable because the presence of a term that
occurred in most of the documents would really tell one very little about
a document. On the other hand, a term that occurred in only 100
documents of one million would be very helpful in limiting the set
of documents of interest. A word that occurred in only 10 documents is
likely to be even more informative and will receive an even higher weight.
The local weight of a term is the measure of its
importance in a particular document. Generally, the more frequent a term
is within a document, the more important it is in representing the content
of that document. However, this relationship is saturating, i.e., as
the frequency continues to go up, the importance of the word increases less
rapidly and finally comes to a finite limit. In addition, we do not want a longer
document to be considered more important just because it is longer; therefore,
a length correction is applied. This local weight computation is based
on the Poisson distribution and the formula can be found in:
Kim W, Aronson AR, Wilbur WJ. Automatic MeSH term assignment and quality assessment,
Proceedings of the 2001 AMIA Symposium, pp 319-23.
(226kb)
The similarity between two documents is computed
by adding up the weights (local wt1 × local wt2 × global wt) of all of the
terms the two documents have in common. This provides an indication of
how related two documents are. The resultant score is an example of a vector
score. Vector scoring was originated by Gerard Salton and has a long history
in text retrieval. The interested reader is referred to Salton, Automatic
Text Processing, Reading, MA: Addison-Wesley, 1989 for further information
on this topic. Our approach differs from other approaches in the
way we calculate the local weights for the individual terms. Once the similarity
score of a document in relation to each of the other documents in the database
has been computed, that document's neighbors are identified as the most
similar (highest scoring) documents found. These closely related documents
are pre-computed for each document in PubMed so that when you select
Related Articles, the system has only to retrieve this list. This
enables a fast response time for such queries.