Ranking the whole MEDLINE database according to a large training set using text indexing
- PMID: 15790421
- PMCID: PMC1274266
- DOI: 10.1186/1471-2105-6-75
Ranking the whole MEDLINE database according to a large training set using text indexing
Abstract
Background: The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine.
Results: We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%.
Conclusion: This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address http://www.ogic.ca/projects/ks2004/.
Figures



Similar articles
-
A protocol for the update of references to scientific literature in biological databases.Appl Bioinformatics. 2003;2(3):189-91. Appl Bioinformatics. 2003. PMID: 15130808
-
Text similarity: an alternative way to search MEDLINE.Bioinformatics. 2006 Sep 15;22(18):2298-304. doi: 10.1093/bioinformatics/btl388. Epub 2006 Aug 22. Bioinformatics. 2006. PMID: 16926219
-
Using argumentation to retrieve articles with similar citations: an inquiry into improving related articles search in the MEDLINE digital library.Int J Med Inform. 2006 Jun;75(6):488-95. doi: 10.1016/j.ijmedinf.2005.06.007. Epub 2005 Sep 13. Int J Med Inform. 2006. PMID: 16165395
-
[Searching for evidence-based data].J Chir (Paris). 2009 Aug;146(4):355-67. doi: 10.1016/j.jchir.2009.08.025. Epub 2009 Sep 22. J Chir (Paris). 2009. PMID: 19775689 Review. French.
-
MEDLINE and MeSH: challenges for end users.Med Ref Serv Q. 1992 Fall;11(3):29-46. doi: 10.1300/J115V11N03_03. Med Ref Serv Q. 1992. PMID: 10122123 Review.
Cited by
-
Systematic association of genes to phenotypes by genome and literature mining.PLoS Biol. 2005 May;3(5):e134. doi: 10.1371/journal.pbio.0030134. Epub 2005 Apr 5. PLoS Biol. 2005. PMID: 15799710 Free PMC article.
-
Protein-Protein Interaction Article Classification Using a Convolutional Recurrent Neural Network with Pre-trained Word Embeddings.J Integr Bioinform. 2017 Dec 13;14(4):20170055. doi: 10.1515/jib-2017-0055. J Integr Bioinform. 2017. PMID: 29236678 Free PMC article.
-
Using cited references to improve the retrieval of related biomedical documents.BMC Bioinformatics. 2013 Mar 27;14:113. doi: 10.1186/1471-2105-14-113. BMC Bioinformatics. 2013. PMID: 23537461 Free PMC article.
-
Recent developments in StemBase: a tool to study gene expression in human and murine stem cells.BMC Res Notes. 2009 Mar 10;2:39. doi: 10.1186/1756-0500-2-39. BMC Res Notes. 2009. PMID: 19284540 Free PMC article.
-
MedlineRanker: flexible ranking of biomedical literature.Nucleic Acids Res. 2009 Jul;37(Web Server issue):W141-6. doi: 10.1093/nar/gkp353. Epub 2009 May 8. Nucleic Acids Res. 2009. PMID: 19429696 Free PMC article.
References
-
- NLM MEDLINE. 2004. http://www.ncbi.nlm.nih.gov/PubMed
-
- NLM Medical Subject Headings (MeSH) 2004. http://www.nlm.nih.gov/mesh/filelist.html
-
- Mitchell TM. Machine Learning. Boston, WCB/McGraw-Hill; 1997.
-
- Yang Y, Liu X. Annual ACM Conference on Research and Development in Information Retrieval. Berkeley, CA, ACM Press; 1999. A re-examination of text categorization methods; pp. 42–49.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources