tmChem: a high performance approach for chemical named entity recognition and normalization

Authors: Robert Leaman, Chih-Hsuan Wei and Zhiyong Lu (PI)

Research highlights

tmChem is an open-source software tool for identifying chemical names in biomedical literature, including chemical identifiers, drug brand and trade names and also systematic formats. tmChem uses conditional random fields with a rich feature set and rule-based post processing modules for resolving local abbreviations and improving consistency. tmChem achieved the highest performance of any submission to the BioCreative IV CHEMDNER task (over 87% F-measure).

Method overview

The tmChem system combines two linear chain conditional random fields (CRF) models employing different tokenizations and feature sets. Model 1 is an adaptation of the BANNER named entity recognizer. It uses the MALLET toolkit and is implemented in Java. Model 2 is repurposed from part of the tmVar system for locating genetic variants. It uses the CRF++ toolkit and is implemented in Perl and C++. Both models employ multiple post processing steps.

Results

tmChem was evaluated on the CHEMDNER test set, using the CEM task (named entity recognition) and several strategies to combine the output of the two models.

Method	Precision	Recall	F-measure
Model 1	0.8595	0.8721	0.8657
Model 2	0.8909	0.8575	0.8739
Naive combination	0.8192	0.9209	0.8671
Heuristic combination	0.8516	0.8906	0.8706
High recall combination	0.7672	0.9212	0.8372

Table 1. Evaluation of tmChem on the CHEMDNER test set using micro-averaged precsion, recall and F-measure.

Downloads

Model 1 Source Code (Java)
Model 2 Source Code (Perl/C++)
tmChem-tagged PubMed results in PubTator
tmChem RESTful API

Please cite

Leaman R, Wei C-H, Lu Z. tmChem: a high performance tool for chemical named entity recognition and normalization. Journal of Cheminformatics, 7(Suppl 1):S3 (2015)