Hide menu

CENIIT Project 06.05:
Efficient analysis and management of XML data


Background and industrial motivation

The eXtensive Markup Language (XML) has encountered an increasing popularity for representation of information during the past few years. The main reason for the popularity of XML is its flexibility. XML can be used for applications and situations ranging from providing limited structure to documents, to providing a database like structure on larger datasets. It can also be used for data storage as well as data exchange. Another reason for the popularity of XML is its simple structure, making it accessible and understandable also for persons with limited computer science background. A problem, however, is that when using XML in real life applications, the specific XML structure is often underspecified or unknown as described by the following scenarios.

  1. A cell biologist is working on how proteins react with each other within living organisms. He has a computer tool for storing and exploring information about his previous experiments. However, as experimenting is expensive he wants to carefully plan future experiments by comparing his results with results from other research teams. Such results are available on the web, as XML files and he wants some easy way to compare it with his own results. The problem is that even though XML standards are available in the area many competing formats exist and the researcher would need to do time consuming implementations of translations between these formats to achieve an analysis and comparison with his own data.

  2. One important feature when designing set-top boxes for digital television is to build means for the user to explore information about coming TV-programs. The standard TV-Anytime (www.tv-anytime.org) proposes to broadcast information about the time schedule and content of coming TV-shows as XML data. However, the broadcasters have only agreed on a standardized XML format for a limited number of features. For many important features they are allowed to define the exact XML format themselves. This means that the end user, who can not be required to know anything about the XML-based representation of information, will only be able to make efficient searches on small parts of the features actually sent to him as he is dependent of the variants of the standards actually implemented in his set-top box.

In both examples the problem is that even though the domain and data in each example are well-defined, there are variations in the XML formats that make the user unable to reach his goals without time consuming and costly programming. Both examples above relate to experience (Strömbäck 2004, Strömbäck and Lambrix 2005a, b, Turcan, Strömbäck and Morris 2003) with standards and standardizations works in the exemplified areas. Standardization work in reality is often time consuming and full of compromises, with the result that real life is full of competing and underspecified standards and often also several versions of standards under development.

The aim of this project is technology for analysis, management and storage of XML data that by itself or guided by the user takes variations between formats into account. This question is very relevant for recent research within databases and the results on XML management will provide an opportunity for large impact in the research community.

Long term vision

The goal with this project is to develop technology that supports a user with efficient access and management of XML data across several competing formats or standards and where the end user has clear knowledge about the semantics of the structure he is interested in, but less knowledge about the structure of the stored data. The approach is to use the user's knowledge about the domain to guide the analysis of data and provide him with tools for further processing and storage of the dataset. To reach this goal we have to address the following three research questions:
  1. How can we provide the user with a quick overview of the content of an XML file without requiring knowledge about the exact structure of the file?
    Our previous studies of variations within XML data within various standards (Strömbäck 2004, Strömbäck & Lambrix 2005a, b) have shown that understanding of the content of a dataset is crucial for further efficient processing of it.
  2. How can we provide the user with more in-depth information about some particular part of an XML-file without requiring user-knowledge about the exact structure of the file?
    Querying XML-data is today available via the query language XQuery (www.w3c.org). Our previous evaluation (Strömbäck 2005) has shown that XQuery is a very powerful query language. It does however require that the user is familiar with the XML structure of a particular dataset which is not the case in the applications of this project.
  3. How can XML files be efficiently stored and accessed?
    Efficient storage and indexing is today an important research topic and several evaluations exist (e.g. Strömbäck 2005). Here we will study how the results from the semantic analysis, achieved in research question 1, can be used for automatic translations to databases and indexing of stored datasets.
During the first year we will address all three research questions by focusing on structural analysis, and semi automatic approaches which forms the basis for development towards automatic approaches for the analysis and management of XML data. In a three years perspective the goal is to move towards more automatic approaches taking into account the semantics and limitations of each domain. The long term goals of the project is to further extend this work by using the experience from built prototypes for evaluations in several domains, also working more and more towards automated and semantic approaches.

Research environment

The project will be situated at the Department of Computer and Information Science (IDA), laboratory for Intelligent Information Systems (IISLAB). IISLAB is headed by Professor Nahid Shahmehri and has expertise within the following areas: Database technology, database design for Internet and mobile applications, information security and integrity, peer-to-peer networks and technology for elderly. In this environment the project will be strongly connected to the team responsible for graduate and undergraduate education within database technology. There are also connections to the application areas bioinformatics, through Patrick Lambrix's work and digital television through the work by Lena Strömbäck in the European project Share it!

Publications

Publications resulting from the project:
  1. Strömbäck L, Ivanova V, Hall D (2011) 'Exploring Statistical Information for Applications-Specific Design and Evaluation of Hybrid XML storage'. Databases, Knowledge, and Data Applications DBKDA 2011, published by IEEE, January 2011, St Maarten, The Netherlands.
  2. Strömbäck L, Freire J, (2011) 'XML Management for Bioinformatics Applications'. Computing in Science and Enineering (CiSE). To appear.

  3. Åsberg M, Strömbäck L (2011) 'Bioinformatics: From Disparate Web Services to Semantics and Interoperability.' International Journal of Advances in Software. To appear. Invited contribution.

  4. Laux F, Strömbäck L, (Eds) (2010) Proceedings of the Second International Conference on Advances in Databases, Knowledge, and Data Applications DBKDA 2010, published by IEEE, April 2010, Les Menuires, France.

  5. Åsberg M, Strömbäck L (2010) 'Interoperable and Easy-to-Use Web Services for the Bioinformatics Community - A Case Study.' The Second International Conference on Advances in Databases, Knowledge, and Data Applications, DBKDA 2010, April 2010, Menuires, Best paper Award

  6. Hall D, Strömbäck L (2010) 'Generation of Synthetic XML for Evaluation of Hybrid XML Systems.' In: Yoshikawa M, Meng X, Yumoto T, et al. (eds) Database Systems for Advanced Applications 15th International Conference, DASFAA 2010, International Workshops: GDM, BenchmarX, MCIS, SNSMW, DIEW, UDM, Tsukuba, Japan, April 1-4, 2010, Revised Selected Papers. Lecture Notes in Computer Science, Volume 6193, 2010, DOI: 10.1007/978-3-642-14589-6.

  7. Lambrix P, Strömbäck L, Tan H (2009) 'Information integration in bioinformatics with ontologies and standards', chapter 8 in Bry, Maluszynski, (eds), Semantic Techniques for the Web: The REWERSE perspective, Springer.

  8. Strömbäck, L, Schmidt S (2009) An Extension of XQuery for Graph Analysis of Biological, Proc. The First International Conference on Advances in Databases, Knowledge, and Data Applications, DBKDA 2009, March 1-6, 2009, Cancun, Mexico

  9. Ellkvist, T, Lena Strömbäck, L, Didier Lins, L, and Freire J (2009) A First Study on Strategies for Generating Workflow Snippets. The first International Workshop on Keyword Search ob structured data KEYS 2009, Collocated with ACM SIGMOD/PODS, June 28th, 2009, Providence, Rhode Island, USA

  10. Ellkvist, T, Koop, D, Freire, J, Silva, C T, and Strömbäck, L (2009) Using Mediation to Achieve Provenance Interoperability, In IEEE International Workshop on Scientific Workflows, 2009.

  11. Strömbäck, L, Åsberg, M, Hall, D (2009) HShreX: a Tool for Design and Evaluation of Hybrid XML Storage, FLexDBIST 2009, Linz Austria.
  12. Köhn D, Strömbäck L (2008) 'A method for Semi-automatic Standard Integration in Systems Biology', Proc. 19th Intern. Conf. on Database and Expert Systems Applications, pp 745 - 752

  13. Lambrix P, Strömbäck L (2008) 'Where is my protein? - Issues in Information Integration', BIOforum Europe, volym 12, pp 24 - 25, Republication as one of 2007:s highlights journal page.

  14. Strömbäck, L Eifrém E, Faraglia P (2008) 'Custmizable XML Management on a Navigational Database Framework', Proc. 19th International Workshop on Database and Expert Systems Applications, Turin, Italy pp 261 - 265

  15. Ellgvist T, Freire J, Koop D, Silva C, Strömbäck L (2008) Using Mediation to Achieve Provenance Interoperability Proc. 4th IEEE International Conference on e-Science

  16. Lambrix P, Strömbäck L, (2007) 'Where is my protein? - Issues in Information Integration', BIOforum Europe, 7-8/07:24-26. Invited contribution. journal page.

  17. Lambrix, P, Tan, H, Jakoniene, V, Strömbäck, L (2007) Biological ontologies. Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences, pp 85-99, Springer, 2007. ISBN-10: 0-387-48436-1, ISBN-13: 978-0-387-48436-5. publisher's book page.

  18. Strömbäck, L, Hall, D, Lambrix, P, (2007) A review of standards for data exchange within systems biology. Proteomics, 7(6):857-867, 2007. Invited contribution. journal page.

  19. Strömbäck, L. (2006) A classification for comparing standardized XML data. Proc. 17th International Workshop on Database and Expert Systems Applications (DEXA'06), pp 517-521, Krakow, Poland, September 2006.

  20. Sauro H M, Uhrmacher A M, Harel D, Hucka M, Kwiatkowska M, Mendes P, Shaffer C A, Strömbäck L, Tyson J J. (2006) Challenges for Modeling and Simulation Methods in Systems Biology. In: Track on Modeling and Simulation in Computational Biology at the Winter Simulation Conference 2006 .

  21. Strömbäck, L. (2006) A method for alignment of standardised XML information within systems biology. Winter Simulation 2006 In: Track on Modeling and Simulation in Computational Biology at the Winter Simulation Conference 2006 .

  22. Strömbäck, L, Jakoniene V, Tan, H, Lambrix P. (2006) Representing, storing and accessing molecular interaction data: a review of models and tools. Briefings in Bioinformatics, 7(4):331-338, 2006. Invited contribution. journal page.

  23. Strömbäck, L, Hall, D. (2006) An evaluation of the Use of XML for Representation, Querying, and Analysis of Molecular Interactions. In T. Grust et. al. (Editors) Current Trends in Database Technology - EDBT 2006 Workshops. LNSC 4254, Springer verlag.

The following is a list of publications prior to the project that are of high relevance for the topics addressed in the project:

  1. Strömbäck L, Lambrix P. (2005) `Representations of molecular pathways: An evaluation of SBML, PSI MI and BioPAX', Bioinformatics, 21(24):4401-4407, 2005. pdf, journal page.

  2. Strömbäck, L. (2005) Possibilities and Challenges Using XML Technology for Storage and Integration of Molecular Interactions16th International Workshop on Database and Expert Systems Applications (DEXA'05), pp 575-579, Copenhagen, Denmark, August 2005.

  3. Strömbäck L, Lambrix P. (2005) `Modeling for simulation and data storage of cellular pathways: Similarities and differences', Proceedings of the Fourth Modeling and Simulation in Biology, Medicine and Biomedical Engineering Conference - BioMedSim, pp 65-72, May 26-27, Linköping, Sweden, 2005.
  4. Strömbäck, L. (2004) XML representations of pathway data: a comparison. In Proceedings of the ACM SIGIR'04 Workshop on Search and Discovery in Bioinformatics . Sheffield, UK.

  5. Turcan, E., Strömbäck, L., Morris J. (2003) Share it! by bringing P2P into the TV-domain. In Proceedings of the Third IEEE International Conference on Peer-to-Peer Computing . Linköping, Sweden.


Page responsible: Lena Strömbäck
Last updated: 2011-01-14