Hide menu

Degree Projects

Projects may be available in the following areas:

  • Ontology Engineering
  • Databases
  • Machine Learning
  • Text mining

Contact: Patrick Lambrix, Jose M. Peña, Fang Wei-Kleiner .

Evaluation of Hadoop as a persistence mechanism for Archetype-Based Electronic Health Record Systems

contact person: Fang Wei-Kleiner, Patrick Lambrix

(Joint project with Sergio Freire, Daniel Karlsson , Erik Sundvall at the Department of Biomedical Engineering)


An electronic health record (EHR) is a computer processable repository of information regarding the health status of a patient. The openEHR Foundation developed a set of specifications, called multilevel modelling approach, designed in order to build future-proof systems. This approach uses a stable reference model (RM) that can be implemented in software, and a flexible domain model expressed in Archetypes and Templates. The RM is the model whose classes will be persisted and tends to be stable, i.e., its classes are intended not to change frequently. The archetypes give the semantic meaning to the objects that are persisted via reference model. OpenEHR's proposal is that structural changes and business rules are reflected in the archetypes rather than in the RM; this way there is no need to make changes in the persistence mechanism, be it relational, object-oriented, XML, etc.

An important decision to be taken when developing systems based on the multi level modelling approach is the choice of the persistence mechanism, so that performance and query requirements are met. In order to be used in production, they must have good performance not only when querying for data about an individual (clinical query) but also for data about a group of individuals (epidemiological query). Since the RM has a large set of classes that can form relatively deep hierarchies, a pure object-relational mapping may not be an efficient solution, which is suggested by the literature and discussions in the openEHR community. Some openEHR-based open-source implementations have been made public recently, but their performances using realistic epidemiological data and queries have not been described. The XML databases do not perform well for epidemiological queries when EHR data, generated according to the RM, is serialized to XML and stored in those databases.

Therefore it is still a open issue how to implement a persistence mechanism that shows good performance for archetype-based systems both for clinical and epidemiological queries. Several NoSQL proposals have been developed over the years for storing non-structured or semi-structured data such as: Column stores, e.g. Hadoop/Hbase (hadoop.apache.org), Cassandra (cassandra.apache.org); Document store, e.g. Terrastore (code.google.com/p/terrastore); Key Value or Tuple stores, e.g., AmazonSimpleDB (aws.amazon.com/simpledb), Graph Databases, e.g., Neo4j (neo4j.org), InfoGRID (infogrid.org); and RDF triple stores e.g. Allegro Graph (www.franz.com/agraph/ allegrograph).

Project Goal

The overall goal of the project is the implementation and evaluation of Hadoop as a persistence mechanism for openEHR-based electronic health record systems.

  1. Background study to gain a thorough understanding of openEHR standards and archetypes.
  2. Design of the index structure of the EHR databody and implementation of the query engine over Hadoop framework.
  3. Evaluation and test.

  • Good knowledge of Java is mandatory
  • Knowledge of XML, XML indexing (e.g. attended Advanced Database course) and Hadoop is beneficial

Date Added: October 9, 2012

Page responsible: Patrick Lambrix
Last updated: 2012-10-09