For more than 150 years, the US National Library of Medicine (NLM) has provided access to the biomedical literature through the analytical efforts of human indexers. Since 1966, access has been provided in the form of electronically searchable document surrogates consisting of bibliographic citations, descriptors assigned by indexers from the Medical Subject Headings (MeSH®) controlled vocabulary and, since 1974, author abstracts for many citations.
The MEDLINE®/PubMed® database contains over 21 million citations. It currently grows at the rate of about 700,000 citations per year and covers 5,591 international biomedical journals in 58 languages. Human indexing consists of reviewing the full text of each article, rather than an abstract or summary, and assigning descriptors that represent the central concepts as well as every other topic that is discussed to a significant extent. Indexers assign descriptors from the MeSH vocabulary of 26,581 main headings, which are often referred to as MeSH Headings (MHs). Main heading descriptors may be further qualified by selections from a collection of 83 topical Subheadings (SHs). In addition there are 203,658 Supplementary Concepts (formerly Supplementary Chemicals) which are available for inclusion in MEDLINE records.
Since 1990, there has been a steady and sizeable increase in the number of articles indexed for MEDLINE, because of both an increase in the number of indexed journals and, to a lesser extent, an increase in the number of in-scope articles in journals that are already being indexed. NLM expects to index over one million articles annually within a few years.
In the face of a growing workload and dwindling resources, we have undertaken the NLM Indexing Initiative (II) to explore indexing methodologies that can help ensure that MEDLINE and other NLM document collections maintain their quality and currency and thereby contribute to NLM's mission of maintaining quality access to the biomedical literature.
The objective of NLM's Indexing Initiative is to investigate methods for automatic and assisted indexing to enhance access to NLM document collections including MEDLINE. The project will be considered a success if our methods result in an increase in indexing efficiency while maintaining or improving access to biomedical information.
Human indexing is an expensive, labor-intensive activity. Indexers are highly trained individuals, not only in one or more of the subject domains covered by the MEDLINE database, but also in MEDLINE indexing practice. The average cost of indexing a MEDLINE article is $9.40; and special situations, such as the average cost of $4.90 to add a gene link, only add to the expense.
Considerations such as the increasing demand on NLM's indexing resources and staff coupled with the flat budgets seen throughout federal agencies make clear that if (semi-) automated methods can be successfully developed and implemented, the project will have an important impact on NLM's ability to continue to provide high-quality services to its constituents. Secondarily, but also importantly, the project should continue to contribute to information science research and should offer training opportunities to young researchers in the field. We hope that the research, training and production efforts undertaken by the Indexing Initiative over the years have indeed made such contributions.
Since its inception in 1996, the Indexing Initiative project has investigated language-based and machine learning subject indexing methods primarily for use by NLM indexers for creating MeSH indexing for MEDLINE. Researchers throughout the Library explored several indexing methodologies, the best of which eventually became a system called the NLM Medical Text Indexer (MTI). MTI indexing recommendations have been available to the indexers since 2002; since then, MTI's usage has grown steadily to the point where indexers request MTI results almost 2,500 times a day -- about 50% of indexing throughput.
The II project owes its success in no small measure to NLM knowledge resources. Specifically, the project critically relies on the continued existence and growth of NLM's MeSH vocabulary and of the Unified Medical Language System® (UMLS®) Knowledge Sources, especially the Metathesaurus®, which currently contains about 2,612,000 concepts, and the SPECIALIST Lexicon (2012) containing about 449,000 lexical entries.
The II core team gratefully acknowledges the many essential contributions to the Indexing Initiative by many NLM colleagues, especially John Wilbur for the PubMed Related Citations indexing method, Natalie Xie for TexTool (an interface to Related Citations), Olivier Bodenreider for Restrict to Mesh, Sonya Shooshan for the annual MetaMap ambiguity study, Aurélie Névéol for spearheading the Subheading Attachment project, Florence Chang of Specialized Information Services (SIS) for MTI post-processing and the overall organization of what has become the Medical Text Indexer, John Rozier of the Office of Computer and Communications Systems (OCCS) for incorporating MTI features into the DCMS system, Barbara Bushman of Cataloging for her assistance in integrating MTI into NLM's cataloging process, and many other Library Operations colleagues, including Deborah Ozga, Lou Knecht, Rebecca Stanger, Joe Thomas and Preeti Kochar for overall guidance from the Index Section's perspective.
Finally, although the II team is proud of many of its recent accomplishments, it is fair to say that having MTI recommendations treated as first-line indexing for some journals (MTIFL) is the most noteworthy. The series of experiments that enabled MTIFL was spearheaded by Marina P. Rappoport of the Index Section. Despite battling serious illness, Marina maintained her high level of energy and intense interest in the research. MTIFL's existence and success owes a deep debt to Marina, and, in Marina's memory, we are honored to acknowledge that debt here.
The logo has a blue background with a female figure sitting at a computer on left side of the logo The words "Indexing Initiative" are roughly centered
in the logo. The rest of the picture details the flow of data. The top flow shows "Text" flowing into the "Indexing Process" which produces "Indexed Text" which then enters an "Input Program" which stores the "Indexed Text" into the "MEDLINE" database. From the left side of the data flow we have the person sitting at the computer creating a "Query" which is incorporated into a "Search Strategy" which provides "Feedback" to the person and also a "Search Statement" to the "Retrieval Program" which uses the search query criteria to access information in the "MEDLINE" database and then to provide "Information" back to the person.