Skip to main page content Skip to main page content

SimConcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedical Text

Authors: Chih-Hsuan Wei, Robert Leaman and Zhiyong Lu (PI)

Research highlights

Here we propose a hybrid approach by integrating a machine learning model, named SimConcept, with a pattern identification strategy to identify individual mentions from a composite named entity. More specifically, we first trained and built a Conditional Random Fields model to detect the composite mentions and subsequently identify the antecedent (e.g., colorectal) and conjuncts regions (e.g., adenomas and carcinoma) of a composite mention. Next, we manually developed four patterns to model the six different types of composite mentions in our study. Finally, by applying our patterns to those previously identified regions in the composite mention, individual mentions are generated in our final output (e.g. colorectal adenomas and colorectal carcinoma).

Method overview

SimConcept consists of two modules as shown in Figure 1. The first module consists of a conditional random field model. In this module, the input mention is separated into tokens and each token assigned labels according to the most likely sequence of states through the model. The second module reassembles the tokens into individual mentions using a pattern identification method.

Figure 1. An overview of the SimConcept workflow.

Results

To evaluate our method, we used leave-one-out cross validation on the three sets (i.e., gene, disease and chemical). Table 1 shows the results of our evaluation, where we see that the overall performance is high for all three entity types. As mentioned in introduction, this study is aimed at helping bioconcept normalization. We therefore applied SimConcept in GenNorm [21] and DNorm [18], and evaluated on the test sets of BioCreative II gene normalization task [12] and NCBI disease corpus [50], respectively (no normalized chemical corpus is available). To avoid training on the test set, the training set for SimConcept excluded the test corpora for GenNorm and DNorm. As shown in Table 4 and Table 5, using SimConcept can further improve the state-of-the-art performance for 1.17% in F-measure (P-value=0.02) for gene normalization and 1.34% in F-measure (P-value=0.03) for disease normalization.

Bioconcepts Precision Recall F-measure
Gene 89.51% 91.35% 90.42%
Disease 87.92% 85.07% 86.47%
Chemical 87.44% 84.71% 86.05%
Table 1. The evaluation of SimConcept corpus.
Tools Precision Recall F-measure
GenNorm + SimConcept 87.01% 86.13% 86.57%
GenNorm 86.72% 84.09% 85.38%
DNorm + SimConcept 80.91% 79.23% 80.06%
DNorm 80.69% 76.85% 78.72%
Table 2. The SimConcept contribution on gene, disease normalization performance.

Please cite