Enhancing text categorization with semantic-enriched representation and training data augmentation

doi:10.1197/jamia.M2051

. 2006 Sep-Oct;13(5):526-35.

doi: 10.1197/jamia.M2051. Epub 2006 Jun 23.

Enhancing text categorization with semantic-enriched representation and training data augmentation

Xinghua Lu¹, Bin Zheng, Atulya Velivelli, Chengxiang Zhai

Affiliations

PMID: 16799127
PMCID: PMC1561790
DOI: 10.1197/jamia.M2051

Enhancing text categorization with semantic-enriched representation and training data augmentation

Xinghua Lu et al. J Am Med Inform Assoc. 2006 Sep-Oct.

. 2006 Sep-Oct;13(5):526-35.

doi: 10.1197/jamia.M2051. Epub 2006 Jun 23.

Authors

Xinghua Lu¹, Bin Zheng, Atulya Velivelli, Chengxiang Zhai

Affiliation

¹ Department of Biostatistics, Bioinformatics and Epidemiology, Charleston, SC 29425, USA. lux@musc.edu

PMID: 16799127
PMCID: PMC1561790
DOI: 10.1197/jamia.M2051

Abstract

Objective: Acquiring and representing biomedical knowledge is an increasingly important component of contemporary bioinformatics. A critical step of the process is to identify and retrieve relevant documents among the vast volume of modern biomedical literature efficiently. In the real world, many information retrieval tasks are difficult because of high data dimensionality and the lack of annotated examples to train a retrieval algorithm. Under such a scenario, the performance of information retrieval algorithms is often unsatisfactory, therefore improvements are needed.

Design: We studied two approaches that enhance the text categorization performance on sparse and high data dimensionality: (1) semantic-preserving dimension reduction by representing text with semantic-enriched features; and (2) augmenting training data with semi-supervised learning. A probabilistic topic model was applied to extract major semantic topics from a corpus of text of interest. The representation of documents was projected from the high-dimensional vocabulary space onto a semantic topic space with reduced dimensionality. A semi-supervised learning algorithm based on graph theory was applied to identify potential positive training cases, which were further used to augment training data. The effects of data transformation and augmentation on text categorization by support vector machine (SVM) were evaluated.

Results and conclusion: Semantic-enriched data transformation and the pseudo-positive-cases augmented training data enhance the efficiency and performance of text categorization by SVM.

PubMed Disclaimer

Figures

**Figure 1**
**Representing concepts with word distributions.** Two hypothetical topics are depicted. The bar lengths indicate the word-sage preference or the conditional probability of observing the a word for a given topic.

**Figure 2**
**The directed acyclic graphical representation of the LDA model.** Each node represents a variable, and a shaded node indicates an observed variable. Each rectangle plate represents a replica of the data structure. The variables D and *N_d* at the bottom right of plate indicate the number of the replicates of the structure.

**Figure 3**
**Baseline performance of SVM.** Different with different cost-factor settings were used for training SVM. Panel A: Recall; Panel B: Precision; Panel C: F scores; and Panel D: normalized utility.

**Figure 4**
**Semantic representation and classification.** Text documents are represented in vocabulary (VocRep) and semantic space (SemRep). Panels A, B, C and D correspond to the recall, precision, F value and utility score comparisons of VocRep and SemRep.

**Figure 5**
The ROC curve analysis for the TREC text categorization. The panels correspond to the A, E, G, and T subtasks respectively. Within each panel, the ROC curve for VocRep is shown as dashed line and open boxes, while that for the SemRep is shown as solid line and open circles. Each symbol represents the sensitivity and false positive rate (1 - specificity) of the trained SVM^light classifier with a given cost-factor (-j) value.

**Figure 6**
Effect of data augmentation. The panels A, B, C and D correspond to recall, precision, utility score and F value.

See this image and copyright information in PMC

Cited by

A Kernel Theory of Modern Data Augmentation.
Dao T, Gu A, Ratner AJ, Smith V, De Sa C, Ré C. Dao T, et al. Proc Mach Learn Res. 2019 Jun;97:1528-1537. Proc Mach Learn Res. 2019. PMID: 31777848 Free PMC article.
Developing Embedded Taxonomy and Mining Patients' Interests From Web-Based Physician Reviews: Mixed-Methods Approach.
Li J, Liu M, Li X, Liu X, Liu J. Li J, et al. J Med Internet Res. 2018 Aug 16;20(8):e254. doi: 10.2196/jmir.8868. J Med Internet Res. 2018. PMID: 30115610 Free PMC article. Review.
Learning to Compose Domain-Specific Transformations for Data Augmentation.
Ratner AJ, Ehrenberg HR, Hussain Z, Dunnmon J, Ré C. Ratner AJ, et al. Adv Neural Inf Process Syst. 2017 Dec;30:3239-3249. Adv Neural Inf Process Syst. 2017. PMID: 29375240 Free PMC article.
Mapping annotations with textual evidence using an scLDA model.
Jin B, Chen V, Chen L, Lu X. Jin B, et al. AMIA Annu Symp Proc. 2011;2011:834-42. Epub 2011 Oct 22. AMIA Annu Symp Proc. 2011. PMID: 22195141 Free PMC article.
Artificial Intelligence in Medicine: Chances and Challenges for Wide Clinical Adoption.
Varghese J. Varghese J. Visc Med. 2020 Dec;36(6):443-449. doi: 10.1159/000511930. Epub 2020 Oct 12. Visc Med. 2020. PMID: 33442551 Free PMC article. Review.

See all "Cited by" articles

References

1. Bourne P. Will a Biological Database Be Different from a Biological Journal? Plos Computational Biology 2005;1(3):179. - PMC - PubMed
1. Hersh WR, Bhuptiraju RT, Ross L, Johnson P, Cohen AM, Kreamer DF. TREC 2004 genomics track overview. 2004. Paper presented at: Text Retrieval Conference (TREC) 2004.
1. Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvEcritical assessment of information extraction for biology. BMC Bioinformatics 2005;6(Suppl 1):S1. - PMC - PubMed
1. Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium Nat Genet 2000;25(1):25-29. - PMC - PubMed
1. Hersh W, Bhupatiraju R. TREC genomics track overview. 2003. Paper presented at: Twelfth Text Retrieval Conference - TREC 2003.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Bourne P. Will a Biological Database Be Different from a Biological Journal? Plos Computational Biology 2005;1(3):179. - PMC - PubMed

[2] Bourne P. Will a Biological Database Be Different from a Biological Journal? Plos Computational Biology 2005;1(3):179. - PMC - PubMed

[3] Hersh WR, Bhuptiraju RT, Ross L, Johnson P, Cohen AM, Kreamer DF. TREC 2004 genomics track overview. 2004. Paper presented at: Text Retrieval Conference (TREC) 2004.

[4] Hersh WR, Bhuptiraju RT, Ross L, Johnson P, Cohen AM, Kreamer DF. TREC 2004 genomics track overview. 2004. Paper presented at: Text Retrieval Conference (TREC) 2004.

[5] Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvEcritical assessment of information extraction for biology. BMC Bioinformatics 2005;6(Suppl 1):S1. - PMC - PubMed

[6] Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvEcritical assessment of information extraction for biology. BMC Bioinformatics 2005;6(Suppl 1):S1. - PMC - PubMed

[7] Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium Nat Genet 2000;25(1):25-29. - PMC - PubMed

[8] Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium Nat Genet 2000;25(1):25-29. - PMC - PubMed

[9] Hersh W, Bhupatiraju R. TREC genomics track overview. 2003. Paper presented at: Twelfth Text Retrieval Conference - TREC 2003.

[10] Hersh W, Bhupatiraju R. TREC genomics track overview. 2003. Paper presented at: Twelfth Text Retrieval Conference - TREC 2003.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Enhancing text categorization with semantic-enriched representation and training data augmentation

Affiliation

Enhancing text categorization with semantic-enriched representation and training data augmentation

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources