Ontologies and Text Mining for Life Science
Project Leader: He Tan
The project is funded by the Center
for Industrial Information Technology (CENIIT) and Stiftelsen Olle Engkvist
Byggmästare.
Background and industry motivation
Life science researchers deliver their findings in scientific publications. At the time of writing, the PubMed/MEDLINE database contains over 20 million scientific abstracts with a growth rate of about 2,000 - 4,000 biomedical articles per day. With such explosive growth, it is challenging to keep track of the knowledge that still resides in free text, needed for research and education. Biomedical text mining (BioTM) is an emerging field that aims at dealing with the issue. It embraces the tasks, such as, the identification of gene regulatory events and protein-protein interactions, the functional annotation of proteins, and the identification and prioritization of disease-related genes.
Text mining is emerging as a field that is particularly important for the pharmaceutical industry. Finding the most efficient drug that makes life better for those who are suffering from diseases, is a long and costly research process. It may cost around 15 years and 1.4 billion dollars. We need a large amount of relevant information to make right decisions on the road, e.g. identification of biological targets that have the potential to be starting points for successful and commercially viable treatments. It is probably the case that the majority of relevant information are still locked in the textual documents (internal and external).
Ontologies for Biomedical Text Mining
Ontologies are conceptual models that aim to support consistent and unambiguous knowledge sharing and that provide a framework for knowledge integration. An ontology links concept labels to their interpretations, i.e. specifications of their meanings including concept definitions and relations to other concepts. They reflect the structure of the domain knowledge and constrain the potential interpretations of terms. Ontologies have been put under the spotlight for providing the framework for semantic representation of textual information, and thus a basis for text mining systems. Up to recently, TM systems mainly use ontologies as terminologies to recognize biomedical terms, by mapping terms occurring in text to concepts in ontologies, or use ontologies to guide and constrain analysis of NLP results, by populating ontologies. In the latter case, ontologies are more actively used as a structured and semantic representation of domain knowledge.
Project Goal and Vision
The long term aim of the project is to provide ontological domain knowledge supported TM strategies for life science. Currently, in the project we aim to develop methodologies and supporting environment so that the construction of corpus that annotated with semantic roles can be efficiently instructed and eased by domain knowledge provided by ontologies. Corpus development is very expensive and time-consuming. Further, the methods for developing domain-specific corpus that annotated with semantic roles has been rather informal and intuitive. We believe that ontologies, as a structured and semantic representation of domain-specific knowledge, can instruct and ease the tasks. This project will produce a large frame-based corpus supporting the sentence-level semantic analysis of biomedical texts. By doing so, the project is expected to propose serious answers to the research questions,
what types of ontologies are needed for and how ontologies can best support biomedical text mining .
Project Result and Status
A frame-based corpus is available here
The frame is built completely based on the domain knowledge provided by the piece of GO describing the event "transport". The core structure of the frame is the same as that of FrameNet.
Publications
- Tan H, Kaliyaperumal R, Benis N, Ontology-driven Construction of Corpus with Frame Semantics Annotations, submitted to BMC Bioinformatics, July 2011.
- Tan H, Kaliyaperumal R, Benis N, Building frame-based corpus on the basis of ontological domain knowledge, Proceedings of the 2011 Workshop on Biomedical Natural Language Processing, ACL-HLT 2011 , 74-82, Portland, Oregon, USA.
- Tan H, A study on the relation between linguistics-oriented and domain-specific semantics, Proceedings of the 3rd International Workshop on Semantic Web Applications and Tools for the Life Sciences, Berlin,Germany, December 8-10, 2010.
- Tan H, Lambrix P,
Selecting an Ontology for Biomedical Text Mining,
Proceedings of the Workshop on BioNLP, 55-62, Boulder, Colorado, USA, 2009.
- Tan H, Lambrix P, 'Selecting an ontology for biomedical text mining', Poster at the 17th Conference on Intelligent Systems for Molecular Biology, ISMB-2009, Stockholm, Sweden, 2009.
- Tan H, 'Knowledge-based Gene Symbol Disambiguation'. CIKM: Proceedings of the 2nd international workshop on Data and text mining in bioinformatics , pp 73-76, Napa Valley, California, USA, 2008.