tmChem: a high performance approach for chemical named entity recognition and normalization
Authors: Robert Leaman, Chih-Hsuan Wei and Zhiyong Lu (PI)
Research highlights
tmChem is an open-source software tool for identifying chemical names in biomedical literature, including chemical identifiers, drug brand and trade names and also systematic formats. tmChem uses conditional random fields with a rich feature set and rule-based post processing modules for resolving local abbreviations and improving consistency. tmChem achieved the highest performance of any submission to the BioCreative IV CHEMDNER task (over 87% F-measure).
Method overview
The tmChem system combines two linear chain conditional random fields (CRF) models employing different tokenizations and feature sets. Model 1 is an adaptation of the BANNER named entity recognizer. It uses the MALLET toolkit and is implemented in Java. Model 2 is repurposed from part of the tmVar system for locating genetic variants. It uses the CRF++ toolkit and is implemented in Perl and C++. Both models employ multiple post processing steps.
Results
tmChem was evaluated on the CHEMDNER test set, using the CEM task (named entity recognition) and several strategies to combine the output of the two models.
Method | Precision | Recall | F-measure |
Model 1 | 0.8595 | 0.8721 | 0.8657 |
Model 2 | 0.8909 | 0.8575 | 0.8739 |
Naive combination | 0.8192 | 0.9209 | 0.8671 |
Heuristic combination | 0.8516 | 0.8906 | 0.8706 |
High recall combination | 0.7672 | 0.9212 | 0.8372 |
Downloads
Model 1 Source Code (Java)
Model 2 Source Code (Perl/C++)
tmChem-tagged PubMed results in PubTator
tmChem
RESTful API
Please cite
- Leaman R, Wei C-H, Lu Z. tmChem: a high performance tool for chemical named entity recognition and normalization. Journal of Cheminformatics, 7(Suppl 1):S3 (2015)