Fault tolerance in computing

Prof. Lorenzo Strigini
Centre for Software Reliability, City University London

16 hours, 4 credits

October 5 - October 8, 2010

Dipartimento di Ingegneria dell'Informazione: Elettronica, Informatica, Telecomunicazioni, Largo Lucio Lazzarino, meeting room

Contacts: Prof. Cinzia Bernardeschi



Fault tolerance, that is, clever use of redundancy, is one of the organising principles for achieving dependability and resilience in all systems. Fault tolerance techniques are well established in some areas of computing, and many off-the-shelf building blocks routinely include some fault tolerance mechanism. Yet, the philosophy of fault tolerance and the knowledge of its design patterns and tricks are not widespread among those who could take advantage of it, especially in the design of applications and of complex hardware-software-human systems. While specific technical communities (e.g., in various safety-critical applications of embedded computers) have consolidated techniques and practices for redundant design, diversity and so on, attempts to improve these practices or to apply the same principles outside these specialised communities often lead to controversy (e.g., in the security community) arising from a lack of a common language to deal with the basic issues in fault tolerance.

These lectures aim to:

  • introduce the need for fault tolerance, the general principles that underlie it, examples of the techniques it uses and of typical design schemes, some trade-offs and decision-making problems implied in the use of fault tolerance, and some of the open research problems;
  • enable students to apply the principles in simple concrete contexts: detect situations in which applying fault tolerance is appropriate, apply it in simple designs of hardware, software and socio-technical systems, spot typical potential defects in fault-tolerant systems and processes and fallacies in their assessment.