CENIIT Project 06.05:
Efficient analysis and management of XML data
Background and industrial motivationThe eXtensive Markup Language (XML) has encountered an increasing popularity for representation of information during the past few years. The main reason for the popularity of XML is its flexibility. XML can be used for applications and situations ranging from providing limited structure to documents, to providing a database like structure on larger datasets. It can also be used for data storage as well as data exchange. Another reason for the popularity of XML is its simple structure, making it accessible and understandable also for persons with limited computer science background. A problem, however, is that when using XML in real life applications, the specific XML structure is often underspecified or unknown as described by the following scenarios.
A cell biologist is working on how proteins react with each
other within living organisms. He has a computer tool for storing and
exploring information about his previous experiments. However, as
experimenting is expensive he wants to carefully plan future
experiments by comparing his results with results from other research
teams. Such results are available on the web, as XML files and he
wants some easy way to compare it with his own results. The problem is
that even though XML standards are available in the area many
competing formats exist and the researcher would need to do time
consuming implementations of translations between these formats to
achieve an analysis and comparison with his own data.
One important feature when designing set-top boxes for digital
television is to build means for the user to explore information about
coming TV-programs. The standard TV-Anytime (www.tv-anytime.org)
proposes to broadcast information about the time schedule and content
of coming TV-shows as XML data. However, the broadcasters have only
agreed on a standardized XML format for a limited number of
features. For many important features they are allowed to define the
exact XML format themselves. This means that the end user, who can not
be required to know anything about the XML-based representation of
information, will only be able to make efficient searches on small
parts of the features actually sent to him as he is dependent of the
variants of the standards actually implemented in his set-top box.
The aim of this project is technology for analysis, management and storage of XML data that by itself or guided by the user takes variations between formats into account. This question is very relevant for recent research within databases and the results on XML management will provide an opportunity for large impact in the research community.
Long term visionThe goal with this project is to develop technology that supports a user with efficient access and management of XML data across several competing formats or standards and where the end user has clear knowledge about the semantics of the structure he is interested in, but less knowledge about the structure of the stored data. The approach is to use the user's knowledge about the domain to guide the analysis of data and provide him with tools for further processing and storage of the dataset. To reach this goal we have to address the following three research questions:
How can we provide the user with a quick
overview of the content of an XML file without requiring knowledge
about the exact structure of the file?
Our previous studies of variations within XML data within various standards (Strömbäck 2004, Strömbäck & Lambrix 2005a, b) have shown that understanding of the content of a dataset is crucial for further efficient processing of it.
How can we provide the user with more in-depth
information about some particular part of an XML-file without
requiring user-knowledge about the exact structure of the file?
Querying XML-data is today available via the query language XQuery (www.w3c.org). Our previous evaluation (Strömbäck 2005) has shown that XQuery is a very powerful query language. It does however require that the user is familiar with the XML structure of a particular dataset which is not the case in the applications of this project.
How can XML files be efficiently stored and
Efficient storage and indexing is today an important research topic and several evaluations exist (e.g. Strömbäck 2005). Here we will study how the results from the semantic analysis, achieved in research question 1, can be used for automatic translations to databases and indexing of stored datasets.
Research environmentThe project will be situated at the Department of Computer and Information Science (IDA), laboratory for Intelligent Information Systems (IISLAB). IISLAB is headed by Professor Nahid Shahmehri and has expertise within the following areas: Database technology, database design for Internet and mobile applications, information security and integrity, peer-to-peer networks and technology for elderly. In this environment the project will be strongly connected to the team responsible for graduate and undergraduate education within database technology. There are also connections to the application areas bioinformatics, through Patrick Lambrix's work and digital television through the work by Lena Strömbäck in the European project Share it!
PublicationsPublications resulting from the project:
- Strömbäck L, Ivanova V, Hall D (2011) 'Exploring Statistical Information for Applications-Specific Design and Evaluation of Hybrid XML storage'. Databases, Knowledge, and Data Applications DBKDA 2011, published by IEEE, January 2011, St Maarten, The Netherlands.
- Strömbäck L, Freire J, (2011) 'XML Management for Bioinformatics Applications'. Computing in Science and Enineering (CiSE). To appear.
- Åsberg M, Strömbäck L (2011) 'Bioinformatics: From Disparate Web Services to Semantics and Interoperability.' International Journal of Advances in Software. To appear. Invited contribution.
- Laux F, Strömbäck L, (Eds) (2010) Proceedings of the Second International Conference on Advances in Databases, Knowledge, and Data Applications DBKDA 2010, published by IEEE, April 2010, Les Menuires, France.
- Åsberg M, Strömbäck L (2010) 'Interoperable and Easy-to-Use Web Services for the Bioinformatics Community - A Case Study.' The Second International Conference on Advances in Databases, Knowledge, and Data Applications, DBKDA 2010, April 2010, Menuires, Best paper Award
- Hall D, Strömbäck L (2010) 'Generation of Synthetic XML for Evaluation of Hybrid XML Systems.' In: Yoshikawa M, Meng X, Yumoto T, et al. (eds) Database Systems for Advanced Applications 15th International Conference, DASFAA 2010, International Workshops: GDM, BenchmarX, MCIS, SNSMW, DIEW, UDM, Tsukuba, Japan, April 1-4, 2010, Revised Selected Papers. Lecture Notes in Computer Science, Volume 6193, 2010, DOI: 10.1007/978-3-642-14589-6.
- Lambrix P, Strömbäck L, Tan H (2009) 'Information integration in bioinformatics with ontologies and standards', chapter 8 in Bry, Maluszynski, (eds), Semantic Techniques for the Web: The REWERSE perspective, Springer.
- Strömbäck, L, Schmidt S (2009) An Extension of XQuery for Graph Analysis of Biological, Proc. The First International Conference on Advances in Databases, Knowledge, and Data Applications, DBKDA 2009, March 1-6, 2009, Cancun, Mexico
- Ellkvist, T, Lena Strömbäck, L, Didier Lins, L, and Freire J (2009) A First Study on Strategies for Generating Workflow Snippets. The first International Workshop on Keyword Search ob structured data KEYS 2009, Collocated with ACM SIGMOD/PODS, June 28th, 2009, Providence, Rhode Island, USA
- Ellkvist, T, Koop, D, Freire, J, Silva, C T, and Strömbäck, L (2009) Using Mediation to Achieve Provenance Interoperability, In IEEE International Workshop on Scientific Workflows, 2009.
- Strömbäck, L, Åsberg, M, Hall, D (2009) HShreX: a Tool for Design and Evaluation of Hybrid XML Storage, FLexDBIST 2009, Linz Austria.
- Köhn D, Strömbäck L (2008) 'A method for Semi-automatic Standard Integration in Systems Biology',
Proc. 19th Intern. Conf. on Database and Expert Systems Applications, pp 745 - 752
- Lambrix P, Strömbäck L (2008)
'Where is my protein? - Issues in Information Integration',
BIOforum Europe, volym 12, pp 24 - 25, Republication as one of 2007:s highlights journal page.
- Strömbäck, L Eifrém E, Faraglia P (2008)
'Custmizable XML Management on a Navigational Database Framework',
Proc. 19th International Workshop on Database and Expert Systems Applications, Turin, Italy pp 261 - 265
- Ellgvist T, Freire J, Koop D, Silva C, Strömbäck L (2008) Using Mediation to Achieve Provenance Interoperability
Proc. 4th IEEE International Conference on e-Science
- Lambrix P, Strömbäck L, (2007)
'Where is my protein? - Issues in Information Integration',
BIOforum Europe, 7-8/07:24-26. Invited contribution. journal page.
- Lambrix, P, Tan, H, Jakoniene, V, Strömbäck, L (2007) Biological ontologies. Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences, pp 85-99,
Springer, 2007. ISBN-10: 0-387-48436-1,
ISBN-13: 978-0-387-48436-5. publisher's book page.
- Strömbäck, L, Hall, D, Lambrix, P, (2007) A review of standards for data exchange within systems biology. Proteomics, 7(6):857-867, 2007. Invited contribution.
- Strömbäck, L. (2006) A classification for comparing standardized XML data. Proc. 17th International Workshop on Database and Expert Systems Applications (DEXA'06), pp 517-521, Krakow, Poland, September 2006.
- Sauro H M, Uhrmacher A M, Harel D, Hucka M, Kwiatkowska M, Mendes P, Shaffer C A, Strömbäck L, Tyson J J. (2006) Challenges for Modeling and Simulation Methods in Systems Biology. In: Track on Modeling and Simulation in Computational Biology at the Winter Simulation Conference 2006 .
- Strömbäck, L. (2006) A method for alignment of standardised XML information within systems biology. Winter Simulation 2006 In: Track on Modeling and Simulation in Computational Biology at the Winter Simulation Conference 2006 .
- Strömbäck, L, Jakoniene V, Tan, H, Lambrix P. (2006) Representing, storing and accessing molecular interaction data: a review of models and tools. Briefings in Bioinformatics, 7(4):331-338, 2006.
- Strömbäck, L, Hall, D. (2006) An evaluation of the Use of XML for Representation, Querying, and Analysis of Molecular Interactions. In T. Grust et. al. (Editors) Current Trends in Database Technology - EDBT 2006 Workshops. LNSC 4254, Springer verlag.
- Strömbäck L, Lambrix P. (2005)
`Representations of molecular pathways: An evaluation of SBML, PSI MI and BioPAX',
Bioinformatics, 21(24):4401-4407, 2005. pdf, journal page.
- Strömbäck, L. (2005) Possibilities and Challenges Using XML Technology for Storage and Integration of Molecular Interactions16th International Workshop on Database and Expert Systems Applications (DEXA'05), pp 575-579, Copenhagen, Denmark, August 2005.
- Strömbäck L, Lambrix P. (2005) `Modeling for simulation and data storage of cellular pathways: Similarities and differences', Proceedings of the Fourth Modeling and Simulation in Biology, Medicine and Biomedical Engineering Conference - BioMedSim, pp 65-72, May 26-27, Linköping, Sweden, 2005.
- Strömbäck, L. (2004) XML representations of pathway data: a comparison. In Proceedings of the ACM SIGIR'04 Workshop on Search and Discovery in Bioinformatics . Sheffield, UK.
- Turcan, E., Strömbäck, L., Morris J. (2003) Share it! by bringing P2P into the TV-domain. In Proceedings of the Third IEEE International Conference on Peer-to-Peer Computing . Linköping, Sweden.
Page responsible: Lena Strömbäck
Last updated: 2011-01-14