Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Mar 24:6:75.
doi: 10.1186/1471-2105-6-75.

Ranking the whole MEDLINE database according to a large training set using text indexing

Affiliations

Ranking the whole MEDLINE database according to a large training set using text indexing

Brian P Suomela et al. BMC Bioinformatics. .

Abstract

Background: The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine.

Results: We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%.

Conclusion: This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address http://www.ogic.ca/projects/ks2004/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Self-consistency test of the algorithm. Fraction of references from the stem cell training set (F) retrieved when selecting a number (N) of top-scoring references in a mixed set combining the training set and the random set. Nouns are better discriminators with F = 0.87 for the top half of the list. F was 0.79 for adjectives, 0.73 for verbs, and 0.70 for nouns plus adjectives. Performance could not be theoretically perfect because there were articles in the training set which were not relevant to stem cells, and there were articles in the random set which were relevant to stem cells.
Figure 2
Figure 2
Distribution of scores in MEDLINE sets. For each of the sets of MEDLINE references analyzed in this work we plot the distribution of score values (using the average over all nouns). The complete MEDLINE (black line with X's) has a maximum around 0.65. The training set composed of 81,416 references annotated with MeSH terms related to stem cells (magenta with diamonds) has a maximum at 2.75 and a "hump" at 1.5. This type of distribution is due to the fact that this set includes both references truly related to stem cells and others that are not and agree more with the general MEDLINE background distribution of scores. The random set of 81,416 references (red with triangles) has, logically, an identical distribution to the whole of MEDLINE. The 6,923 randomly selected MEDLINE references (green with squares) used for the recall and precision test also follow the background distribution. Of those, the 204 references evaluated as stem cell related by a human expert (blue bars) had significantly higher scores than the background distribution of MEDLINE.
Figure 3
Figure 3
Recall and precision of the algorithm. The recall and the precision of the algorithm were checked in a set of 6,923 references not included in the training set. Manual examination of the set resulted in the identification of 204 references (positives) relevant to stem cells. Recall was measured as TP/(TP+FN) and precision as TP/(TP+FP), where TP is true positives, FP is false positives, and FN is false negatives.

Similar articles

Cited by

References

    1. NLM MEDLINE. 2004. http://www.ncbi.nlm.nih.gov/PubMed
    1. NLM Medical Subject Headings (MeSH) 2004. http://www.nlm.nih.gov/mesh/filelist.html
    1. Mitchell TM. Machine Learning. Boston, WCB/McGraw-Hill; 1997.
    1. Yang Y, Liu X. Annual ACM Conference on Research and Development in Information Retrieval. Berkeley, CA, ACM Press; 1999. A re-examination of text categorization methods; pp. 42–49.
    1. Kim W, Aronson AR, Wilbur WJ. Automatic MeSH term assignment and quality assessment. Proc AMIA Symp. 2001:319–323. - PMC - PubMed

Publication types

LinkOut - more resources