Responsible for this page: He Tan, hetan@ida.liu.se
Page last updated: 2006-12-08
LiU » IDA » He Tan


[ Go to content ] [ Help ] [ Information about accessability ]
Maps Contact us
Go to LiU.se
LiU » IDA » He Tan

Biomedical Text Mining

The volume of published biomedical research, and therefore the underlying biomedical knowledge base, is expanding at an increasing rate. Pharmaceutical industry estimates suggest that 90% of drug targets are derived from the biomedical literature. At least 50% of the facts essential for validation of drug targets have already been reported in massive amounts of biomedical literature. The MEDLINE database contains over 17 million scientific abstracts with a growth rate of about 2,000 - 4,000 articles per day. With such explosive growth, it is extremely challenging to keep up to date with all of the new discoveries and theories from the literature. The literature is mostly available only in the form of free text. Therefore, it is much more difficult to obtain information and knowledge from the literature than from structured databases. During the last few years, there has been a surge of interest in text mining of the biomedical literature.

Gene Symbol Disambiguation (GSD)

When mining biomedical literature, a big challenge is the problem of gene symbol ambiguity. There is no community-wide agreement on how a particular gene and gene product should be named. A gene symbol: 1) may refer to a particular gene; 2) may include homologues of this gene in other organisms; 3) may denote an RNA, or encompass the protein the gene encodes, or 4) may be restricted to a specific splice variant.

The task of GSD is to determine the unique identifiers of genes and proteins mentioned in scientific literature.

Current work (CENIIT project 08.08 )

In [1] we propose a method for gene symbol disambiguation which relies on information about gene candidates, contexts of gene symbols and external knowledge sources. We extract information about gene candidates from gene databases. Biomedical ontologies are used to determine the context of gene symbols relevant to gene candidate information. Disambiguation is based on matching contexts of the symbol to information about gene candidates.

The avialable degree projects related to GSD is here.

Reference

Tan H, (2008). 'Knowledge-based Gene Symbol Disambiguation'. Second International Workshop on Data and Text Mining in Bioinformatics, Napa Valley, California, USA.