Biomedical Text Mining
The volume of published biomedical research, and therefore the
underlying biomedical knowledge base, is expanding at an
increasing rate. Pharmaceutical industry estimates suggest
that 90% of drug targets are derived from the biomedical
literature. At least 50% of the facts essential for
validation of drug targets have already been reported in
massive amounts of biomedical literature. The MEDLINE
database contains over 17 million scientific abstracts
with a growth rate of about 2,000 - 4,000 articles per
day. With such explosive growth, it is extremely
challenging to keep up to date with all of the new
discoveries and theories from the literature. The
literature is mostly available only in the form of free
text. Therefore, it is much more difficult to obtain
information and knowledge from the literature than from
structured databases. During the last few years, there has
been a surge of interest in text mining of the biomedical
literature.
Gene Symbol Disambiguation (GSD)
When mining biomedical literature, a big challenge is the problem of
gene symbol ambiguity. There is no community-wide
agreement on how a particular gene and gene product should
be named. A gene symbol: 1) may refer to a particular
gene; 2) may include homologues of this gene in other
organisms; 3) may denote an RNA, or encompass the protein
the gene encodes, or 4) may be restricted to a specific
splice variant.
The task of GSD is to determine the unique
identifiers of genes and proteins mentioned in scientific
literature.
In [1] we propose a method for gene symbol disambiguation which relies
on information about gene candidates, contexts of
gene symbols and external knowledge sources. We
extract information about gene candidates from
gene databases. Biomedical ontologies are used to
determine the context of gene symbols relevant to
gene candidate information. Disambiguation is
based on matching contexts of the symbol to
information about gene candidates.
The avialable
degree projects related to GSD is
here.
Reference
Tan H, (2008). 'Knowledge-based Gene Symbol Disambiguation'. Second International Workshop on Data and Text Mining in Bioinformatics, Napa Valley, California, USA.