Electrical isolation of ignition sources on offshore installations

triangle-of-fireOne of the typical major accident scenarios considered when building and operating an offshore drilling or production rig, is a gas leak that is ignited, leading to a jet fire, or even worse, an explosion. For this scenario to happen we need three things (from the fire triangle):

  • Flammable material (the gas)
  • Oxygen (air)
  • An ignition source

The primary protection against such accidents is containment of flammable materials; avoiding leaks is the top safety priority offshore. As a lot of this equipment exists outdoors (or in “naturally ventilated areas” as standards tend to call it), it is not an option to remove “air”. Removing the ignition source hence becomes very important, in the event that you have a gas leak. The technical system used to achieve this consists of a large number of distributed gas detectors on the installation, sending message of detected gas to a controller, which then sends a signal to shut down all potential ignition sources (ie non-EX certified equipment, see the ATEX directive for details).

This being the barrier between “not much happening” and “a major disaster”, the reliability of this ignition source control is very important. Ignition sources are normally electrical systems not designed specifically to avoid ignition (so-called EX certified equipment). In order to have sufficient reliability of this set-up the number of ignition sources should be kept at a minimum; this means that the non-EX equipment should be grouped in distribution boards such that an incomer breaker can be used to isolate the whole group, instead of doing it at the individual consumer level. This is much more reliable, as the probability of a failure on demand (PFD) will contain an additive term for each of the breakers included:

PFD = PFD(Detector) + PFD(Logic) + Sum of PFD of each breaker

Consider a situation where you have 100 consumers, and the dangerous undetected failure rate for the breakers used is 10-7 failures per hour of operation, with testing every 24 months, the contribution from a single breaker is

PFD(Breaker) = 10-7 x (8760 x 2) / 2 = 0.000876

If we then have 6 breakers that need to open for full isolation, we have a breaker PFD contribution of 0.005 from the breakers (which means that with reliable gas detectors and logic solver, a full loop can satisfy a SIL 2 requirement). If we have 100 breakers the contribution to PFD is 0.08 – and the best we can hope for is SIL 1.

What is the difference between software and hardware failures in a reliability context?

Reliability engineers have traditionally focused more on hardware than software. There are many reasons for this; one reason is that traditionally safety systems have been based on analog electronics, and although digitial controls and PLC’s have been introduced throughout the 1990’s, the actual software involved was in the beginning very simple. Today the situation has really changed, but the focus in reliability has not completely taken this onboard. One of the reasons may be that reliability experts like to calculate probabilities – which they are very good at doing for hardware failures. Hardware failures tend to be random and can be modeled quite well using probabilistic tools. So – what about software? The failure mechanisms are very different – as failures in hardware are related to more or less stochastic effects stemming from load cycling, material defects and ageing, software defects or completely deterministic (we disregard stochastic algorithms here – they are banned from use in safety critical control system anyway).

Software defects exist for two reasons: design errors (flaws) and implementation errors (bugs). These errors may occur at the requirement stage or during actual coding, but irrespective of the time they occur, they are always static. They do not suddenly occur – they are latent errors hidden within the code – that will active each and every time the software state where the error is relevant is visited.

Such errors are very difficult to include in a probabilistic model. That is why reliability standards prescribe a completely different medicine; a process oriented framework that gives requirements to management, choice of methods and tools, as well as testing and documentation. These quality directed workflows and requirements are put in place such that we should have some confidence in the software not being a significant source of unsafe failures of the critical control system.

Hence – process verification and auditing take the place of probability calculations when we look at the software. In order to achieve the desired level of trust it is very important that these practices are not neglected in the functional safety work. Deterministic errors may be just as catastrophic as random ones – and therefore they must be managed with just as much rigor and care. The current trend is that more and more functionality is moved from hardware to software – which means that software errors are becoming increasingly important to manage correctly if we are not going to degrade both performance and trust of the safety instrumented systems we rely on to protect our lives, assets and the environment.

Does safety engineering require security engineering?

Safetey critical control systems are developed with respect to reliability requirements, often following a reliability standard such as IEC 61508 or CENELEC EN 50128. These standards put requirements on development practices and activities with regard to creating software that works the way it is intended based on the expected input, and where availability and integrity is of paramount importance. However, these standards do not address information security. Some of the practices required from reliability standards do help in removing bugs and design flaws – which to a large extent also removes security vulnerabilites – but they do not explicitly express such conceerns. Reliability engineering is about building trust into the intended functionality of the system. Security is about lack of unintended functionality.

Consider a typical safety critical system installed in an industrial process, such as an overpressure protection system. Such a system may consist of a pressure transmitter, a logic unit (ie a computer) and some final elements. This simple system meausres the pressure  and transmits it to the computer, typically over a hardwired analog connection. The computer then decides if the system is within a safe operating region, or above a set point for stopping operation. If we are in the unsafe region, the computer tells the final element to trip the process, for example by flipping an electrical circuit breaker or closing a valve. Reliability standards that include software development requirements focus on how development is going to work in order to ensure that whenever the sensor transmits pressure above the threshold, the computer will tell the process to stop. Further the computer is connected over a network to an engineering station which is used for such things as updating the algorithm in the control system, changing the threshold limits, etc.

What if someone wants to put the system out of order, without anyone noticing? The software’s access control would be a crucial barrier against anyone tampering with the functionality. Reliability standards do not talk about how to actually avoid weak authentication schemes, although they talk about access management in general. You may very well be compliant with the reliability standard – yet have very weak protection against compromising the access control. For example, the coder may very well use a “getuser()” call  in C in the authentication part of the software – without violating the reliability standard requirements. This is a very unsecure way of getting user credentials from the computer and should generally be avoided. If such a practice is used, a hacker with access to the network could with relaitve ease  get admin access to the system and change for example set points, or worse, recalibrate the pressure  sensor to report wrong readings – something that was actually done in the Stuxnet case.

In other words – as long as someone can be interested in harming your operation – your safety system needs security built-in, and that is not coming for free through reliability engineering. And there is always someone out to get you – for sports, for money or just because they do not like you. Managing security is an important part of managing your business risk – so do not neglect this issue while worrying only about reliability of intended functionality.

Planning lifecycle activities for safety instrumented systems

Modern industrial safety instrumented systems are often required to be designed in accordance with IEC 61508 or IEC 61511. These standards about functional safety take a lifecycle view on the safety instrumented system. Most people associate this with SIL – or safety integrity levels, which is an important concept in these standards. Many newcomers to functional safety focus only on quantitative measures of reliability and do not engage with the lifecycle process. This leads to poorer designs than necessary, and compliance with requirements from these standards is not possible without taking the whole lifecycle into account.

A good way to look at a safety instrumented system, is to define phases of the lifecycle, and then assign activities for managing the safety instrumented system throughout these phases. Based on IEC 61511 we can define these phases as:

  • Design
  • Construction
  • Commissioning
  • Operation and maintenance
  • Decomissioning

In other words – we need to manage the safety instrumented system from conception to grave – in line with asset management thinking in general. For each these phases there will typically be various activities related to the safety instrumented system that we will need to focus on. For example, in the design phase we need to focus on identifying the necessary risk reduction, performing risk analysis and determining necessary SILs for the different safety instrumented functions making up the system. A key document emerging from this phase is the “Safety Requirement Specification”. Typically, in the same phase one would start to map out vendors and put out requests for offers on equipment to buy. A guideline for vendors on what type of documentation they should provide would also be good to prepare in this early phase. The Norwegian oil and gas association has made a very nice guideline (Guideline No. 070) for application of functional safety in the oil industry; this guideline contains a very good description of what type of documentation would need to be collected. This is a good starting point.

Also part of design, and typically lasting into the construction phase as well, we would find activities such as compliance assessment (it is necessary to check whether the requirements in the SRS are actually fulfilled, based on documentation form eqipment vendors and system integrators). In addition, at this point it is necessary to complete a Functional Safey Assessment (FSA), a third-party review in the form of an audit to check that the work has been done the way the standards require us to.

Part of the plan should be on how to commission the safety instrumented system. When are the different functions tested, what type of verifications are we doing on the programming of actions based on inputs? Who is responsibel for this? All of this should be planned out from the start.

Further, when the system is taken into operation, the complete asset (including the SIS) is delivered to the company that is going to operate it. The owner is then responsible for maintenance of the system, for proof testing and ensuring that all barrier elements necessary for the system to work are in place. These type of activities should be planned as well.

Finally, the end-of-life for the asset should be managed. How to actually manage that should be part of the plan – taking the system out of service as a whole or only in parts shoudl still be done while maintaining the right level of safety for people, environment and other assets that may be harmed if an accident should occur.

Finally, there are a number of aspects that should be included in a plan for managing functional safety, that span over all these lifecycle phases. These are things like competence management of people involved in working with the SIS in the different lifecycles, how to deal with changes of the system or the environment the system is operating in, who is responsible for what and how to communicate across company interfaces – this list is not exhaustive. Consult the standards for looking at the details.

If all organizations involved in functional safety design would plan out their acitivites in a good way fewer changes would occur towards the end of large engineering projects, better quality would be obtained at a lower cost. And this, is a low-hanging fruit that we all should grab.