Named Entity Recognition in Biomedical and Other Domains

Named Entity Recognition (NER) involves identification and classification of words or sequences of words denoting a concept or entity in a piece of text. Thus, "Colorado Springs" is a locational named entity, "Barack Obama" is the name of a person, "University of Colorado at Colorado Springs" is the name of an institution or organization. Every domain has its specialized names. For example, "cell cycle-dependent transcription factor" is the name of a protein in the biomedical domain and "peripheral blood monocytes" is the name of a cell type.

Prior Work

One can use various methods to train a progam to identify named entities of various kinds. It is a classification problem. Conditional Random Fields are a popular technique. So are Support Vector Machines (SVMs). SVMs have high generalization capability and can handle high-dimensional data. SVMs do not suffer from local minima, have fewer parameters, and produce reproducible results. However, SVMs suffer from slow training with large input. SVMs are primarily binary classifiers. Multi-class problems are often solved by combining several binary machines. (Habib and Kalita 2007, Habib 2008, Habib 2008 Dissertation) present a new multi-class Support Vector Machine (SVM) implementation for solving the Named Entity Recognition (NER) problem in the biomedical domain. The approach eliminates language or domain-specific knowledge and achieves good out-of-the-box accuracy. It reduces the training time of multi-class SVM by orders of magnitude.

Future Work