SimConcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedical Text

Authors: Chih-Hsuan Wei, Robert Leaman and Zhiyong Lu (PI)

Research highlights

Here we propose a hybrid approach by integrating a machine learning model, named SimConcept, with a pattern identification strategy to identify individual mentions from a composite named entity. More specifically, we first trained and built a Conditional Random Fields model to detect the composite mentions and subsequently identify the antecedent (e.g., colorectal) and conjuncts regions (e.g., adenomas and carcinoma) of a composite mention. Next, we manually developed four patterns to model the six different types of composite mentions in our study. Finally, by applying our patterns to those previously identified regions in the composite mention, individual mentions are generated in our final output (e.g. colorectal adenomas and colorectal carcinoma).

Method overview

SimConcept consists of two modules as shown in Figure 1. The first module consists of a conditional random field model. In this module, the input mention is separated into tokens and each token assigned labels according to the most likely sequence of states through the model. The second module reassembles the tokens into individual mentions using a pattern identification method.

Figure 1. An overview of the SimConcept workflow.

Results

To evaluate our method, we used leave-one-out cross validation on the three sets (i.e., gene, disease and chemical). Table 1 shows the results of our evaluation, where we see that the overall performance is high for all three entity types. As mentioned in introduction, this study is aimed at helping bioconcept normalization. We therefore applied SimConcept in GenNorm [21] and DNorm [18], and evaluated on the test sets of BioCreative II gene normalization task [12] and NCBI disease corpus [50], respectively (no normalized chemical corpus is available). To avoid training on the test set, the training set for SimConcept excluded the test corpora for GenNorm and DNorm. As shown in Table 4 and Table 5, using SimConcept can further improve the state-of-the-art performance for 1.17% in F-measure (P-value=0.02) for gene normalization and 1.34% in F-measure (P-value=0.03) for disease normalization.

Bioconcepts	Precision	Recall	F-measure
Gene	89.51%	91.35%	90.42%
Disease	87.92%	85.07%	86.47%
Chemical	87.44%	84.71%	86.05%

Table 1. The evaluation of SimConcept corpus.

Tools	Precision	Recall	F-measure
GenNorm + SimConcept	87.01%	86.13%	86.57%
GenNorm	86.72%	84.09%	85.38%
DNorm + SimConcept	80.91%	79.23%	80.06%
DNorm	80.69%	76.85%	78.72%

Table 2. The SimConcept contribution on gene, disease normalization performance.

Downloads

SimConcept Source Code

Please cite

Wei C-H, Leaman R, Lu Z. SimConcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedicine. Proceedings of the ACM Conference on Bioinformatics Computational Biology and Health Informatics, Newport Beach, CA, 2014, p138-146