Software is a major source of reliability degradation in dependable systems.
One of the classical remedies is to provide software fault-tolerance by using NVersion
Programming (NVP). However, due to requirements on special hardware
and the need for changes and additions at all levels of the system, NVP solutions
are costly, and have only been used in special cases.
In a previous work, a low-cost architecture for NVP execution was developed.
The key features of this architecture are the use of off-the-shelf components
and that the fault-tolerance functionality, including voting, error detection,
fault-masking, consistency management, and recovery, is moved into a separate
redundancy management circuitry (one for each redundant computing node).
In this article we present an improved design of that architecture, specifically
resolving some potential inconsistencies that were not treated in detail in the original
design. In particular, we present novel techniques for enforcing replica determinism
and a method for reintegration of the redundancy management circuitry
after a transient failure.
Our improved architecture is based on using the Controller Area Network
(CAN). This has several benefits, including low-cost, and that the CAN data consistency
allows us to simplify the mechanisms for replica determinism and reintegration.
Although initially developed for NVP, our redundancy management circuitry
also supports other software replication techniques, such as active replication.
2007.