Hide menu

Integration and Interoperability of Graph-Data Systems

Research project funded by the Centrum för Industriell Informationsteknologi (CENIIT), project no. 17.05

Project leader: Olaf Hartig

Abstract: This 6-year project aims to investigate approaches (i) to integrate graph data across different systems that manage and process such data, and (ii) to integrate such systems as members of a federated system; this federated system will be able to perform workloads of queries and analysis algorithms transparently on the data that is distributed over the federation members. Moreover, in the context of this project, the applicant will establish a research group that specializes in topics related to federated graph data management.

Project Background and Industrial Relevance

In recent years we are witnessing an increasing industry interest in technologies for processing and managing graph data, where the term graph data refers to data that describes entities and the relationships between them in the form of a graph. Typical examples include data about transportation networks, protein interactions, biological food chains, social networks, bibliographic networks, topic maps, and knowledge bases.

The increasing interest in such data and in the technologies to manage it is testified by statistics indicating that "Graph DBMS increased their popularity by 500% within [...] 2 years [from 2013]," taking into account number of mentions on Websites, in professional profiles and job offers, as well as frequency of technical discussions in on-line forums. Similarly, Forrester predicted in 2014 that by 2017 over 25% of enterprises would be using graph databases, and, a year later, a Forrester market analysis concludes that "the graph database market is embryonic but will grow significantly." Corresponding to this trend is an emergence of a plethora of open source and commercially-backed software systems that focus on graph data; new companies appear that center their business around graph data, and also established database vendors such as Oracle and IBM extend their product lines to provide graph data management solutions.

The specific challenge of processing and managing graph data lies in the high structural heterogeneity that is inherent in many datasets. More specifically, many graph datasets have skewed degree distributions. Due to such skew and the inherent dependencies within graph data, computational workloads on graphs may easily become unbalanced and it is notoriously difficult to scale and to parallelize them. Additionally, the diversity of emerging types of graph queries and graph analytics tasks calls for a broad variety of different data management and data processing techniques. As a consequence, systems for processing or managing graph data, hereafter, simply called graph-data systems, are designed for different types of use cases, and we can divide these systems roughly into the following three classes:

  • graph database systems that focus on graph-specific queries (e.g., neighborhood queries, navigational queries, traversals), exemplars of such systems are Neo4j, OrientDB, and Sparksee;
  • graph analytics systems that focus on complex graph analysis algorithms (e.g., community detection, influence propagation, centrality analysis), exemplars are Apache Giraph, PowerGraph, and GraphX;
  • triple stores that focus on pattern matching queries and semantic reasoning based on a family of standards related to the Resource Description Framework (RDF) [6] and its query language SPARQL [12], exemplars are AllegroGraph, Stardog, and Blazegraph.

The inherent specialization and the diversity of graph-data systems bear a high potential that only a combination of multiple systems can sufficiently address all the graph query and graph analytics use cases within an organization or enterprise. For instance, consider applications for crime investigation and prevention (as worked on by IDA researchers in the EC-funded VALCRI project). Such applications may operate over large-scale graph-based datasets that describe the relationships between persons (e.g., victims, suspects, witnesses), locations (e.g., crime scenes), events (e.g., crimes), objects (e.g., as found on crime scenes), observations, intelligence reports, etc. An application that is implemented based on pattern matching queries over such graph data may enable users to identify connected crimes (e.g., different burglaries that all happened in the same area within a short duration of time), whereas an implementation of complex clustering algorithms over such data may be used to reveal the connections between different members of criminal organizations. Observe that, while both of these example applications may be implemented by accessing the same graph-based dataset, the inherently different nature of the required graph processing tasks makes it very unlikely that these applications can be supported efficiently by the same underlying graph-data system. Given the diverse ecosystem of graph-data systems and the diversity of graph data applications, we foresee the emergence of many scenarios in which different (or even the same) collections of the graph data within an organization or an enterprise may be managed in multiple separate graph-data systems. Certainly, the same holds across organizations.

Based on these observations, we identify the following two general goals for the proposed project:

  • G1: develop and investigate approaches to integrate graph data across different graph-data systems;
  • G2: develop and investigate approaches to integrate graph-data systems as members of a federated system that can perform workloads of queries and analysis algorithms transparently on the data distributed over the federation members.
In addition to national and international enterprises that employ, or aim to employ, graph-data systems, the results of this research are highly relevant to the following types of industrial stakeholders:
  • Vendors of graph-data systems (such as Oracle, and the US-based company Blazegraph, LLC) may use the results to improve and extend their systems; e.g., by adding new data integration features.
  • Companies that provide data integration products or services for relational databases (such as the US-based company Capsenta, Inc.) may extend their portfolio by introducing support for graph-data systems.
  • Service providers for data-centric infrastructure projects (such as MetaSolutions AB, Sweden, and LocaliData, Spain) may use the results to extend their expertise and employ graph data technologies in future projects.

Long Term Vision of the Project

The long-term vision of the project is to put Linköping University on the map for excellent research in different areas of data management. To this end, the applicant has the ambition to establish a strong research group that starts out by specializing in topics related to federated graph data management. The specific niche that the applicant aims to focus on in this area are approaches that support an integration of highly heterogeneous sets of graph-data systems that are based on multiple different abstract data models, which is a topic that is not yet much on the radar scope of other researchers in the area (judging from existing publications in the literature).

To achieve international recognition the new group will publish their research results in top conferences and journals of the field; additionally, the group will publish high-quality software source code and documentation of the research prototypes that it builds. Another way to increase the visibility of the group will be to deliver research tutorials in top conferences and to organize research workshops related to the topics of the group.

Research Environment and Industrial Cooperation

The project will be conducted within the ADIT division of IDA. The project leader has an extensive expertise in both the foundations and systems-research on queries over federations of RDF-based graph data. Moreover, through contract work for the triple store vendor Blazegraph, LLC, the project leader also has experience in building complex graph data management solutions that are used in production environments. Hence, with his expertise and experience the project leader feels in an excellent position not only to conduct the project and achieve the aforementioned outcomes, but also to enable students to excel in their contributions to the project.

The project leader plans to use his industry contacts to disseminate the results of the project among enterprises and to identify opportunities for applying these results in an industrial context. The following industrial contacts have already expressed their interests in the project: Oracle, Blazegraph, LLC, Capsenta, Inc., LocaliData, and MetaSolutions AB.

In addition to industrial connections, the project leader has a wide network of academic collaborators that he will leverage to achieve the goals of the project. These collaborators include researchers from the Universities of Waterloo and of Toronto (both Canada); the Universidad de Chile, the Pontificia Universidad Catolica de Chile, and the Universidad Tecnica Federico Santa Maria (Chile); Ghent University (Belgium), the University of Oxford (UK), and the Institute for High Performance Computing and Networking (ICAR-CNR, Italy).

Status and List of Publications

The project has started in 2017.

Publications of the project leader that will be direct input to the project:

Initial publications resulting from the project:

Page responsible: Olaf Hartig
Last updated: 2017-08-31