LiU > IDA > Real-Time Systems Lab
ABOUT
MEMBERS
COOPERATION
PROJECTS
PUBLICATIONS
COURSES
OPEN POSITIONS
THESES
ALUMNI

Announcements

[16 May 2017] A bachelor student at RTSLAB was awarded the best thesis award from IDA - Tim Hultman. more ...

[12 May 2016] A master student at RTSLAB was awarded the best thesis award from IDA - Alexander Alesand. more ...

[12 May 2016] A bachelor student at RTSLAB was awarded the best thesis award from IDA - Mathias Almquist and Viktor Almquist. more ...

[25 May 2015] A master student at RTSLAB was awarded the best thesis award from IDA - Klervie Toczé. more ...

[26 May 2014] A bachelor student at RTSLAB was awarded the best thesis award from IDA - Simon Andersson. more ...

[31 May 2012] A masters student at RTSLAB was awarded the best thesis award from IDA - Ulf Magnusson. more ...

[27 February 2008] A masters student at RTSLAB was awarded the best thesis award from IDA - Johan Sigholm. more ...

[03 March 2004] A masters student at RTSLAB was awarded the best thesis award from IDA - Tobias Chyssler. more ...

[01 Jul 2003] For second year in a row a masters student at RTSLAB was awarded the best thesis award from SNART - Mehdi Amirijoo. more ...

Master Thesis - Past Projects - Abstract

Estimating Time to Repair Failures in a Distributed System

ID: LIU-IDA/LITH-EX-G--16/072?SE

To ensure the quality of important services, high availability is critical. One aspect to be considered in availability is the downtime of the system, which can be measured in time to recover from failures. In this report we investigate current research on the subject of repair time and the possibility to estimate this metric based on relevant parameters such as hardware, the type of fault and so on. We thoroughly analyze a set of data containing 43 000 failure traces from Los Alamos National Laboratory on 22 different cluster organized systems. To enable the analysis we create and use a program which parses the raw data, sorts and categorizes it based on certain criteria and formats the output to enable visualization. We analyze this data set in consideration of type of fault, memory size, processor quantity and at what time repairs were started and completed. We visualize our findings of number of failures and average times of repair dependent on the different parameters. For different faults and time of day we also display the empirical cumulative distributionfunction to give an overview of the probability for different times of repair. The failures are caused by a variety of different faults, where hardware and software are most frequently occurring. These two along with network faults have the highest average downtime. Time of failure proves important since both day of week and hour of day shows patterns that can be explained by for example work schedules. The hardware characteristics of nodes seem to affect the repair time as well, how this correlation works is although difficult to conclude. Based on the data extracted we suggest two simple methods of formulating a mathematical model estimating downtime which both prove insufficient; more research on the subject and on how the parameters affect each other is required.

Keywords:

File: Click here to download/view the thesis

Author(s): Matilda Söderholm and Lisa Habbe

Contact: Mikael Asplund

Click here to return.
Last modified February 2017. If you have questions or suggestions for the webpages, contact the webmaster