Fault-tolerance is the characteristic that allows a system to enhance operation in case of failure of some of the components of a system. If its functioning quality reduces at all, the decline is relative to the extent of the malfunction, as compared to a poorly-designed system in which can lead to a minor failure can lead to full breakdown. An outstanding example of fault tolerance mechanisms is the digital computer systems. There are various characteristics that determine how systems fail and the appropriate fault tolerance mechanisms. One of the elements indicate that the system should be digital. The second trait is that digital systems encode information. The third characteristic is that digital systems can aid in the modification of the behavior of the system based on the information that they process (Poledna, 1996).
Fault tolerance is sometimes known as redundancy management, which defines provision of operational capabilities that may not be necessary in an environment free from fault. A computer system may present superfluous functions in such a manner that at least one of the results in the presence of fault is correct (Pelliccione, 2007).
Redundancy management contains the following actions.
Fault detection
This is the process of examining that a fault has taken place.
Fault Diagnosis
This is the step of examining the cause of the fault or determining the faulty subsystem.
Fault containment
This is the stage of preventing the multiplication of faults from one point to ruin the entire system hindering quality performance (Poledna, 1996).
Fault masking
This is the process by which only correct values are allowed to pass to the system even if the component fails (Poledna, 1996).
Fault compensation
If a fault takes place and it gets incarcerated to a subsystem, it is advisable for the system to give a response to reimburse for output of the flawed subsystem (Poledna, 1996).
Fault repair
This process involves the removal of the faulty system. In a well-developed fault tolerant system, faults are controlled before the rise to the level that may not be contained.
The next fault tolerance mechanism would be acceptance test techniques. This is a mechanism that is used to influence the remnants of the fault tolerance functions. The two key components of fault detection include comparison and acceptance tests. It is also characterized by various processes that ensure that faults are managed accordingly. These processes include; fault detection, fault diagnosis, fault containment, fault masking, fault compensation and fault repair (Banâtre, 1994).
The other fault tolerance mechanism involves comparison techniques. This is the alternative for acceptance tests for detecting problems. For example, if the main source of a problem is the processor in the computer, then several processors should be used to carry out the same program (Pelliccione, 2007). As the calculation for results are calculated they get compared across all processors. If they mismatch in their functions a fault is present. The comparison may be in pairs or involve more than two processors at the same time. The process of testing more than two processors at the same time is referred to as voting (Pelliccione, 2007).
The other fault tolerance mechanism that may be considered is diversity. This is the implementation of more than one variant of the function to be carried out. In computer-based applications it is accepted that is more valuable to change a design at higher abstraction levels than in the variation of the design details. Since different designs must execute a common system condition, the probability for dependencies always comes up in the specification refining process that reflects challenges that may be ignored in the process of implementation (Banâtre, 1994).
References
Banâtre, M. (1994). Hardware and software architectures for fault tolerance: experiences and perspectives. Berlin: Springer-Verlag.
Pelliccione, P. (2007). Software engineering of fault tolerant systems. Singapore: World Scientific.
Poledna, S. (1996). Fault-tolerant real-time systems the problem of replica determinism. Boston: Kluwer Academic Publishers.