What is the difference between software and hardware failures in a reliability context?

Reliability engineers have traditionally focused more on hardware than software. There are many reasons for this; one reason is that traditionally safety systems have been based on analog electronics, and although digitial controls and PLC’s have been introduced throughout the 1990’s, the actual software involved was in the beginning very simple. Today the situation has really changed, but the focus in reliability has not completely taken this onboard. One of the reasons may be that reliability experts like to calculate probabilities – which they are very good at doing for hardware failures. Hardware failures tend to be random and can be modeled quite well using probabilistic tools. So – what about software? The failure mechanisms are very different – as failures in hardware are related to more or less stochastic effects stemming from load cycling, material defects and ageing, software defects or completely deterministic (we disregard stochastic algorithms here – they are banned from use in safety critical control system anyway).

Software defects exist for two reasons: design errors (flaws) and implementation errors (bugs). These errors may occur at the requirement stage or during actual coding, but irrespective of the time they occur, they are always static. They do not suddenly occur – they are latent errors hidden within the code – that will active each and every time the software state where the error is relevant is visited.

Such errors are very difficult to include in a probabilistic model. That is why reliability standards prescribe a completely different medicine; a process oriented framework that gives requirements to management, choice of methods and tools, as well as testing and documentation. These quality directed workflows and requirements are put in place such that we should have some confidence in the software not being a significant source of unsafe failures of the critical control system.

Hence – process verification and auditing take the place of probability calculations when we look at the software. In order to achieve the desired level of trust it is very important that these practices are not neglected in the functional safety work. Deterministic errors may be just as catastrophic as random ones – and therefore they must be managed with just as much rigor and care. The current trend is that more and more functionality is moved from hardware to software – which means that software errors are becoming increasingly important to manage correctly if we are not going to degrade both performance and trust of the safety instrumented systems we rely on to protect our lives, assets and the environment.

What does a “SIL” requirement really mean?

Safety instrumented systems are often assigned a “Safety Integrity Level”: This is an important concept for ensuring that automatic controls intended to maintain the safety of a technical safety actually bring the risk reduction that is necessary. In the reliability standards IEC 61508 and IEC 61511, there are 4 SILs:

  • SIL 1: a failure on demand in 1 out of 10 demands is acceptable
  • SIL 2: a failure on demand in 1 out of 100 demands is acceptable
  • SIL 3: a failure on demand in 1 out of 1 000 demands is acceptable
  • SIL 4: a failure on demand in 1 out of 10 000 demands is acceptable

This way of defining the probability¬†of failure applies to so-called “low-demand” systems. In practice that means that the safety function does not need to act more than once per year in order to stop an accident from occurring.

The SIL requirement does not only involve probability calculations (Probability for failure on demand = PFD). The SIL consists of four diffent types of requirements:

  • Quantitative requirement (PFD, defined as probability of failure when there is a demand for the function)
  • Semi-quantitative requirements (requirement for redundancy, for a certain number of possible failures of the system leading to a safe state – the socalled safe failure fraction)
  • Software requirements (a lot of the actual control functionality is implemented in software. For this a work process oriented take on things is required by the standards – implications increase in rigor with increasing SIL)
  • Qualitative requirements (avoidance of systematic errors, quality mangement, etc.)

Most people focus only on the quantitative part and do not think about the latter thre parts. In order for us to have trust in the probability assessment, it is necessary that issues that cannot be quanitifed are properly managed. Hence – to claim that you have achived a certain SIL for your safety function, you need to document that the redundancy is right, that most failures will lead to a safet state, that your software has been developed in accordance with required practices and using acceptable technologies, and that your organization and workflows ensure sufficient quality of your safety function product and the system it is a part of.

If people buying components for safety instrumented systems would keep this in mind – it would become much easier to actually create safety critical automation systems with can trust with a given level of integrity.