SimConcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedical Text
Authors: Chih-Hsuan Wei, Robert Leaman and Zhiyong Lu (PI)
Research highlights
Here we propose a hybrid approach by integrating a machine learning model, named SimConcept, with a pattern identification strategy to identify individual mentions from a composite named entity. More specifically, we first trained and built a Conditional Random Fields model to detect the composite mentions and subsequently identify the antecedent (e.g., colorectal) and conjuncts regions (e.g., adenomas and carcinoma) of a composite mention. Next, we manually developed four patterns to model the six different types of composite mentions in our study. Finally, by applying our patterns to those previously identified regions in the composite mention, individual mentions are generated in our final output (e.g. colorectal adenomas and colorectal carcinoma).
Method overview
SimConcept consists of two modules as shown in Figure 1. The first module consists of a conditional random field model. In this module, the input mention is separated into tokens and each token assigned labels according to the most likely sequence of states through the model. The second module reassembles the tokens into individual mentions using a pattern identification method.

Results
To evaluate our method, we used leave-one-out cross validation on the three sets (i.e., gene, disease and chemical). Table 1 shows the results of our evaluation, where we see that the overall performance is high for all three entity types. As mentioned in introduction, this study is aimed at helping bioconcept normalization. We therefore applied SimConcept in GenNorm [21] and DNorm [18], and evaluated on the test sets of BioCreative II gene normalization task [12] and NCBI disease corpus [50], respectively (no normalized chemical corpus is available). To avoid training on the test set, the training set for SimConcept excluded the test corpora for GenNorm and DNorm. As shown in Table 4 and Table 5, using SimConcept can further improve the state-of-the-art performance for 1.17% in F-measure (P-value=0.02) for gene normalization and 1.34% in F-measure (P-value=0.03) for disease normalization.
Bioconcepts | Precision | Recall | F-measure |
Gene | 89.51% | 91.35% | 90.42% |
Disease | 87.92% | 85.07% | 86.47% |
Chemical | 87.44% | 84.71% | 86.05% |
Tools | Precision | Recall | F-measure |
GenNorm + SimConcept | 87.01% | 86.13% | 86.57% |
GenNorm | 86.72% | 84.09% | 85.38% |
DNorm + SimConcept | 80.91% | 79.23% | 80.06% |
DNorm | 80.69% | 76.85% | 78.72% |
Downloads
Please cite
- Wei C-H, Leaman R, Lu Z. SimConcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedicine. Proceedings of the ACM Conference on Bioinformatics Computational Biology and Health Informatics, Newport Beach, CA, 2014, p138-146