BioOntoFN

Ontology-driven Construction of Corpus with Frame Semantics Annotations for Biomedicine Domain


Background

The sentence-level semantic analysis of text is concerned with the characterization of events, such as determining "who" did "what" to "whom", "where", "when" and "how". It plays a key role in text mining (TM) applications such as Information Extraction (IE), Question Answering and Document Summarization. The predicate of a clause expresses "what" took place, and other sentence constituents express the participants in the event (such as "who" and "where"). Semantic Role Labeling (SRL) is a process that, for each predicate in a sentence, indicates what semantic relations hold among the predicate and its associated sentence constituents. The relations are described by using a list of pre-defined possible semantic roles for that predicate (or class of predicates).

Recently , large corpora have been manually annotated with semantic roles in FrameNet and PropBank. With the advent of resources, SRL has become a well-defined task with a substantial body of work and comparative evaluation. As with other technologies in natural language processing (NLP), researchers have experienced the difficulties of adapting SRL systems to a new domain, different than the domain used to develop and train the system. Biomedical text considerably differs from FrameNet and PropBank data, both in the style of the written text and the predicates involved . The development of SRL systems for the biomedical domain has been frustrated by the lack of large domain-specific corpora that are labeled with semantic roles.

Method

The difficulties in building domain corpus that annotated with semantic roles, include how to discover and define semantic frames together with associated semantic roles within the domain? how to collect and group domain-specific predicates to each semantic frame? and how to select example sentences from publication databases, such as the PubMed/MEDLINE database containing over 20 million articles? In this project we propose a method for building corpus that are labeled with semantic roles for the domain of biomedicine. The method is based on the theory of frame semantics, and relies on domain knowledge provided by ontologies. We believe that ontologies, as a structured and semantic representation of domain knowledge, can instruct us and ease all the tasks in building this kind of corpus.

The FrameNet project is the application of the theory of Frames Semantics [1] in computational lexicography. Frame semantics begins with the assumption that in order to understand the meanings of the words in a language, we must first have knowledge of the background and motivation for their existence in the language and for their use in discourse. The knowledge is provided by the conceptual structures, or semantic frames. Ontology is a formal representation of knowledge of a domain of interest. They reflect the structure of the domain knowledge and constrain the potential interpretations of terms. Intuitively, ontological concepts, relations, rules and their associated textual definitions can be used as the frame-semantic descriptions imposed on a corpus. We propose a method of ontology-driven construction of corpus based on the theory of Frames Semantics [2]. Here we outline the aspects of ontology driven frame-semantic descriptions:

  • The structure and semantics of domain knowledge in ontologies constrain the frame semantics analysis, i.e. decide the coverage of semantic frames and the relations between them;
  • Ontological terms can comprehensively describe the characteristics of events/scenarios in the domain, so domain-specific semantic roles can be determined based on terms;
  • Ontological terms provide domain-specific predicates, so the semantic senses of the predicates in the domain are determined;
  • The collection and selection of example sentences can be based on knowledge-based search engine for biomedical text.

By using the method, we have built a corpus for transport events strictly following the piece of domain knowledge provided by GO biological process ontology. We compared one of the frames in the corpus, Protein Transport, to the frame Protein_transport in BioFrameNet. We examined the gaps between the semantic classification of the target words in this domain-specific corpus and in FrameNet and PropBank/VerbNet data. The successful corpus construction demonstrates that the method can ease all the tasks in building domain corpus with frame semantics annotations. Furthermore, ontological domain knowledge leads to well-defined semantics exposed on the corpus, which will be very valuable in text mining applications.


Corpus Data

Frame Protein Transport

The construction of the frame (definition, description and the collection of data) strictly follows the domain knowledge provided by the piece of GO describing the event "protein transport". The core structure of the frame is the same as that of FrameNet. The annotations follow FrameNet's guidelines for lexicographic annotation, described in (Ruppenhofer et al., 2005).

Protein Transport inherits the frame Transport (the data will be published soon! [3]).

Credits

The project is under the leadership of He Tan . In addition to the project leader the following people have been involved in the corpus construction.
  • Rajaram Kaliyaperumal
  • Nirupama Benis
The work is funded by the Center for Industrial Information Technology (CENIIT) and Stiftelsen Olle Engkvist Byggmästare.

References

[1] Charles J. Fillmore. 1985. Frames and the semantics of understanding. Quaderni di Semantica, 6(2).
Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson and Jan Scheffczyk. 2005. ICSI. FrameNet II: Extended Theory and Practice.

[2] Tan H, Kaliyaperumal R, Benis N. 2011. Building frame-based corpus on the basis of ontological domain knowledge, Proceedings of the 2011 Workshop on BioNLP , 74-82, Portland, Oregon, USA.

[3] Tan H, Kaliyaperumal R, Benis N, ’Ontology-driven Construction of Corpus with Frame Semantics Annotations’. accepted to 13th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2012).

[4] Tan H, Kaliyaperumal R, Benis N, ’Ontology-driven Construction of Corpus with Frame Semantics Annotations’. poster at The Fourth International Symposium on Languages in Biology and Medicine (LBM 2011) . Singapore.




Responsible for this page: He Tan
Last updated: 2011-05-06