Ranking the whole MEDLINE database according to a large training set using text indexing

doi:10.1186/1471-2105-6-75

. 2005 Mar 24:6:75.

doi: 10.1186/1471-2105-6-75.

Ranking the whole MEDLINE database according to a large training set using text indexing

Brian P Suomela¹, Miguel A Andrade

Affiliations

PMID: 15790421
PMCID: PMC1274266
DOI: 10.1186/1471-2105-6-75

Ranking the whole MEDLINE database according to a large training set using text indexing

Brian P Suomela et al. BMC Bioinformatics. 2005.

. 2005 Mar 24:6:75.

doi: 10.1186/1471-2105-6-75.

Authors

Brian P Suomela¹, Miguel A Andrade

Affiliation

¹ Ontario Genomics Innovation Centre, Ottawa Health Research Institute, 501 Smyth Rd, Ottawa, Ontario K1H 8L6, Canada. bsuomela@ohri.ca

PMID: 15790421
PMCID: PMC1274266
DOI: 10.1186/1471-2105-6-75

Abstract

Background: The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine.

Results: We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%.

Conclusion: This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address http://www.ogic.ca/projects/ks2004/.

PubMed Disclaimer

Figures

**Figure 1**
**Self-consistency test of the algorithm**. Fraction of references from the *stem cell* training set (F) retrieved when selecting a number (N) of top-scoring references in a mixed set combining the *training set* and the *random set*. Nouns are better discriminators with F = 0.87 for the top half of the list. F was 0.79 for adjectives, 0.73 for verbs, and 0.70 for nouns plus adjectives. Performance could not be theoretically perfect because there were articles in the training set which were not relevant to *stem cells*, and there were articles in the random set which were relevant to *stem cells*.

**Figure 2**
**Distribution of scores in MEDLINE sets**. For each of the sets of MEDLINE references analyzed in this work we plot the distribution of score values (using the average over all nouns). The complete MEDLINE (black line with X's) has a maximum around 0.65. The *training set* composed of 81,416 references annotated with MeSH terms related to *stem cells* (magenta with diamonds) has a maximum at 2.75 and a "hump" at 1.5. This type of distribution is due to the fact that this set includes both references truly related to *stem cells* and others that are not and agree more with the general MEDLINE background distribution of scores. The *random set* of 81,416 references (red with triangles) has, logically, an identical distribution to the whole of MEDLINE. The 6,923 randomly selected MEDLINE references (green with squares) used for the recall and precision test also follow the background distribution. Of those, the 204 references evaluated as *stem cell* related by a human expert (blue bars) had significantly higher scores than the background distribution of MEDLINE.

**Figure 3**
**Recall and precision of the algorithm**. The recall and the precision of the algorithm were checked in a set of 6,923 references not included in the training set. Manual examination of the set resulted in the identification of 204 references (positives) relevant to stem cells. Recall was measured as TP/(TP+FN) and precision as TP/(TP+FP), where TP is true positives, FP is false positives, and FN is false negatives.

See this image and copyright information in PMC

Cited by

Systematic association of genes to phenotypes by genome and literature mining.
Korbel JO, Doerks T, Jensen LJ, Perez-Iratxeta C, Kaczanowski S, Hooper SD, Andrade MA, Bork P. Korbel JO, et al. PLoS Biol. 2005 May;3(5):e134. doi: 10.1371/journal.pbio.0030134. Epub 2005 Apr 5. PLoS Biol. 2005. PMID: 15799710 Free PMC article.
Protein-Protein Interaction Article Classification Using a Convolutional Recurrent Neural Network with Pre-trained Word Embeddings.
Matos S, Antunes R. Matos S, et al. J Integr Bioinform. 2017 Dec 13;14(4):20170055. doi: 10.1515/jib-2017-0055. J Integr Bioinform. 2017. PMID: 29236678 Free PMC article.
Using cited references to improve the retrieval of related biomedical documents.
Ortuño FM, Rojas I, Andrade-Navarro MA, Fontaine JF. Ortuño FM, et al. BMC Bioinformatics. 2013 Mar 27;14:113. doi: 10.1186/1471-2105-14-113. BMC Bioinformatics. 2013. PMID: 23537461 Free PMC article.
Recent developments in StemBase: a tool to study gene expression in human and murine stem cells.
Sandie R, Palidwor GA, Huska MR, Porter CJ, Krzyzanowski PM, Muro EM, Perez-Iratxeta C, Andrade-Navarro MA. Sandie R, et al. BMC Res Notes. 2009 Mar 10;2:39. doi: 10.1186/1756-0500-2-39. BMC Res Notes. 2009. PMID: 19284540 Free PMC article.
MedlineRanker: flexible ranking of biomedical literature.
Fontaine JF, Barbosa-Silva A, Schaefer M, Huska MR, Muro EM, Andrade-Navarro MA. Fontaine JF, et al. Nucleic Acids Res. 2009 Jul;37(Web Server issue):W141-6. doi: 10.1093/nar/gkp353. Epub 2009 May 8. Nucleic Acids Res. 2009. PMID: 19429696 Free PMC article.

See all "Cited by" articles

References

1. NLM MEDLINE. 2004. http://www.ncbi.nlm.nih.gov/PubMed
1. NLM Medical Subject Headings (MeSH) 2004. http://www.nlm.nih.gov/mesh/filelist.html
1. Mitchell TM. Machine Learning. Boston, WCB/McGraw-Hill; 1997.
1. Yang Y, Liu X. Annual ACM Conference on Research and Development in Information Retrieval. Berkeley, CA, ACM Press; 1999. A re-examination of text categorization methods; pp. 42–49.
1. Kim W, Aronson AR, Wilbur WJ. Automatic MeSH term assignment and quality assessment. Proc AMIA Symp. 2001:319–323. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

[1] NLM MEDLINE. 2004. http://www.ncbi.nlm.nih.gov/PubMed

[2] NLM MEDLINE. 2004. http://www.ncbi.nlm.nih.gov/PubMed

[3] NLM Medical Subject Headings (MeSH) 2004. http://www.nlm.nih.gov/mesh/filelist.html

[4] NLM Medical Subject Headings (MeSH) 2004. http://www.nlm.nih.gov/mesh/filelist.html

[5] Mitchell TM. Machine Learning. Boston, WCB/McGraw-Hill; 1997.

[6] Mitchell TM. Machine Learning. Boston, WCB/McGraw-Hill; 1997.

[7] Yang Y, Liu X. Annual ACM Conference on Research and Development in Information Retrieval. Berkeley, CA, ACM Press; 1999. A re-examination of text categorization methods; pp. 42–49.

[8] Yang Y, Liu X. Annual ACM Conference on Research and Development in Information Retrieval. Berkeley, CA, ACM Press; 1999. A re-examination of text categorization methods; pp. 42–49.

[9] Kim W, Aronson AR, Wilbur WJ. Automatic MeSH term assignment and quality assessment. Proc AMIA Symp. 2001:319–323. - PMC - PubMed

[10] Kim W, Aronson AR, Wilbur WJ. Automatic MeSH term assignment and quality assessment. Proc AMIA Symp. 2001:319–323. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Ranking the whole MEDLINE database according to a large training set using text indexing

Affiliation

Ranking the whole MEDLINE database according to a large training set using text indexing

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources