Researchers in bioinformatics often have to retrieve data from multiple biological data sources (DSs) to solve their research problems. Many such DSs are publicly available on the Web. As most DSs are developed and maintained independently, they are highly heterogeneous. They vary in the type of the stored data, the data format, and access methods. In addition, there is a terminology discrepancy at the data level and at the schema level, which even more complicates the data retrieval process. The user must decide which DSs to access and in which order, how to retrieve the data and how to combine the results - in short, the task of retrieving data requires a great deal of effort and expertise on the part of the user. Also, the users have to take into account that bioinformatics is a dynamic field where DS schemas change and new DSs are developed. Information integration systems (IISs) aim to alleviate these problems by providing a uniform (or even integrated) interface to underlying DSs. IISs may need to find DSs that are relevant to a user query, divide a query into smaller subqueries and, combine the retrieved results.
In this project we propose an IIS for biological DSs. We have proposed a base query language that contains operators that should be present in any query language for biological DSs. The main features of the query language include an object model, queries about types and values, paths and path variables as well as the use of specialized functions that allow for hooks to alignment and string search programs. Further, we developed an architecture for a system supporting such a language and providing integrated access to biological DSs. The proposed architecture contains a mediator consisting of a query interpreter and expander, a retrieval engine that generates query plans consisting of sub-queries to the different data sources, and an answer filter and assembler. Further, the architecture assumes the existence of an ontology base, a data source knowledge base with information about the contents and capabilities of the DSs as well as the use of wrappers that encapsulate the DSs. As a feasibility study, a small prototype was implemented. The current prototype generates a space of possible alternative plans in the presence of alternative DSs, DSs mirrored at different sites, alternative integration conditions between the DSs, and alternative ways to order data retrieval from the DSs. The prototype supports answering of queries expressed over an integrated schema that are rewritten into queries over the DSs. The prototype integrates the results retrieved from the DSs and compensates limited data source query capabilities.
We also argue that the available ontological knowledge on the Web should be used for the integration of biological data. We have identified different types of ontological knowledge that are publicly available on the Web in the field of bioinformatics and have shown how this can be used to support integrated access to multiple biological DSs.
For questions, please contact firstname.lastname@example.org.