How independent should your FSA leader be?

Functional safety assessment is a mandatory 3rd party review/audit for functional safety work, and is required by most reliability standards. In line with good auditing practice, the FSA leader should be independent of the project development. Exactly what does this mean? Practice varies from company to company, from sector to sector and even from project to project. It seems reasonable to require a greater degree of independence for projects where the risks managed through the SIS are more serious. IEC 61511 requires (Clause 5.2.6.1.2) that functional safety assessments are conducted with “at least one senior competent person not involved in the project design team”. In a note to this clause the standard remarks that the planner should consider the independence of the assessment team (among other things). This is hardly conclusive.

If we go to the mother standard IEC 61508, requirements are slightly more explicit, as given by Clause 8.2.15 of IEC 61508-1:2010, which states that the level of independence shall be linked to perceived consequence class and required SILs. For major accident hazards, two categories are used in IEC 61508:

  • Class C: death to several people
  • Class D: very many people killed

For class C the standard accepts the use of an FSA team from “independent department”, whereas for class D only an “independent organization” is acceptable. Further, also for class C, an independent organization should be used if the degree of complexity is high, the design is novel or the design organization is lacking experience with this particular type of design. There are also requirements based on systematic capability in terms of SIL but those are normally less stringent in the context of industrial processes than the consequence based requirements to FSA team independence. The standard also specifies that compliance to sector specific standards, such as IEC 61511, would make a different basis for consideration of independence acceptable.

In this context, the definitions of “independent department” and “independent department” are given in Part 4 of the standard. An independent department is separate from and distinct from departments responsible for activities which take place during the specified phase of the overall system or software lifecycle subject to the validation activity. This means also, that the line managers of those departments should not be the same person. An independent organization is separate by management and other resources from the organizations responsible for activities that take place during the lifecycle phase. This means, in practice, that the organization leading a HAZOP or LOPA should not perform the FSA for the same project if there are potential major accident hazards within the scope, and preferably also not if there are any significant fatal accident risks in the project. Considering the requirement of separate management and resource access, it is not a non-conformity if two different legal entities within the same corporate structure perform the different activities, provided they have separate budgets and leadership teams.

If we consider another sector specific standard, EN 50129 for RAMS management in the European railway sector, we see that similar independence requirements exist for third-party validation activities. Figure 6 in that standard seemingly allows the assessor to be a part of the same organization as an organization involved in SIS development, but further requires for this situation that the assessor has an authorization from the national safety authority, is completely independent form the project team and shall report directly to the safety authorities. In practice the independent assessor is in most cases from an independent organization.

It is thus highly recommended to have an FSA team from a separate organization for all major SIS developments intended to handle serious risks to personnel; this is in line with common auditing practice in other fields.

Why is this important? Because we are all humans. If we feel ownership to a certain process, product or affiliation with an organization, it will inevitably be more difficult for us to point out what is not so good. We do not want to hurt people we work with by stating that their work is not good enough – even if we know that inferior quality in a safety instrumented system may actually lead to workers getting killed at work later. If we look to another field with the same type of challenges but potentially more guidance on independence, we can refer to the Sarbanes-Oxley act of 2002 from the United States. The SEC has issued guidelines about auditor independence and what should be assessed. Specifically they include:

  1. Will a relationship with the auditor create a mutual or conflicting interest with their audit client?
  2. Will the relationship place the auditor in the position of auditing his/her own work?
  3. Will the relationship result in the auditor acting as management or an employee of the audit client?
  4. Will the relationship result in the auditor being put in a position where he/she will act as an advocate for the audit client?

It would be prudent to consider at least these questions if considering using an organization that is already involved in the lifecycle phase subject to the FSA.

What is the demand rate on a safety function?

When we estimate the reliability of a safety instrumented function, we separate between “low demand” functions and “high demand” or “continuous demand” functions. These are all safety critical functions but their nature is different in terms of how frequently they must act on the system of study.

Consider for example the braking system on a train – the brakes need to work every time they are used – for every curve, for every station. The train driver will activate the brakes several times every hour. Obviously, this is a “high demand” system. As an example of the opposite, think of systems where we are monitoring some process and only acting if the system detects a dangerous state. A common example of this is an over-temperature trip on a heating system; if the temperature becomes too high in the system, the system shuts off power to the heater through a circuit breaker (assuming it is an electrical heater). Nobody will design a system such that its intention is to overheat, so this function would only need to activate when a specific scenario hits. Whether this is a “low demand” or “high demand” function depends on how often the function must work – and this again depends both on the “intrinsic frequency” of overheating, and other protection measures that may exist such as independent alarms or special operator training and procedures.

If you think of the stove guard installed in your kitchen that monitors overheating in the range area, what would the demand rate be? If we assume you are a 25-year old person and normally functioning as long as you are sober, you would not forget to turn off the plate on the oven more than once per year. In addition to this, you may get really drunk 10 times per year, and you cook something some of those times, with a higher probability of forgetting – say this also happens once per year – then you have an initial event rate of 2 times per year. Is this the demand rate on the stove guard function? It depends. Do you have any other measures that help you reduce the fire risk?

Typically you would have a smoke detector with alarm, and possibly the stove guard would also give you a pre-alarm. The smoke detector would be completely independent from the stove guard – and if it goes off you would react to it. This takes down the demand on the stove guard if you look at it solely as a way to stop fires from occurring (smoke comes before fire). We now assume that the smoke alarm works 1 out of 10 times and that you or a (sober) friend would always react correctly in this situation – it is normally easy to identify the smoke coming from the kitchen. Then we have reduced the demand on the stove guard to 2 x 0.1 = 0.2 times per year. This is safe to put in the bracket “low demand”. We did not count the pre-alarm on the stove guard itself on purpose, because it can have common cause failures with the core functionality of the stove guard – if one fails, the other one fails too.

The next natural question to ask in this connection is: “how reliable must the stove guard be”? We may conservatively assume that every 10th time there is a real demand on the guard and it does fail there will be a fire that can kill you and destroy the house. This risk is quite severe, and say you would only allow your house on average to burn down every 10.000 years, statistically speaking. This is your “acceptance criterion”. That is, you accept 0.0001 fires per year due to this potential source. We know the demand is 0.2 times per year – what is the allowable probability of failure on demand for the stove guard? This would be 0.0001 / 0.02 = 0.005. This means that we should require the system to have SIL 2 performance with a PFD of minimum 0.005 with this acceptance criterion, if we have a system developed in accordance with IEC 61508.

As a side note – the Norwegian research institute SINTEF has tested some stove guards. They tested 7 different types, and concluded that only 3 of them worked well. The reliability of the devices also depend on installation (location of sensors). This means that close to SIL 3 performance seems unreasonable to expect for the solutions on the market today. The SINTEF report is found at the Norwegian Directorate for Civil Protection’s website.

Electrical isolation of ignition sources on offshore installations

triangle-of-fireOne of the typical major accident scenarios considered when building and operating an offshore drilling or production rig, is a gas leak that is ignited, leading to a jet fire, or even worse, an explosion. For this scenario to happen we need three things (from the fire triangle):

  • Flammable material (the gas)
  • Oxygen (air)
  • An ignition source

The primary protection against such accidents is containment of flammable materials; avoiding leaks is the top safety priority offshore. As a lot of this equipment exists outdoors (or in “naturally ventilated areas” as standards tend to call it), it is not an option to remove “air”. Removing the ignition source hence becomes very important, in the event that you have a gas leak. The technical system used to achieve this consists of a large number of distributed gas detectors on the installation, sending message of detected gas to a controller, which then sends a signal to shut down all potential ignition sources (ie non-EX certified equipment, see the ATEX directive for details).

This being the barrier between “not much happening” and “a major disaster”, the reliability of this ignition source control is very important. Ignition sources are normally electrical systems not designed specifically to avoid ignition (so-called EX certified equipment). In order to have sufficient reliability of this set-up the number of ignition sources should be kept at a minimum; this means that the non-EX equipment should be grouped in distribution boards such that an incomer breaker can be used to isolate the whole group, instead of doing it at the individual consumer level. This is much more reliable, as the probability of a failure on demand (PFD) will contain an additive term for each of the breakers included:

PFD = PFD(Detector) + PFD(Logic) + Sum of PFD of each breaker

Consider a situation where you have 100 consumers, and the dangerous undetected failure rate for the breakers used is 10-7 failures per hour of operation, with testing every 24 months, the contribution from a single breaker is

PFD(Breaker) = 10-7 x (8760 x 2) / 2 = 0.000876

If we then have 6 breakers that need to open for full isolation, we have a breaker PFD contribution of 0.005 from the breakers (which means that with reliable gas detectors and logic solver, a full loop can satisfy a SIL 2 requirement). If we have 100 breakers the contribution to PFD is 0.08 – and the best we can hope for is SIL 1.

Stupid resource planning in projects

 

Most projects are supposed to work like an S-curve; first there is a slow kick-off period where people are familiarizing with the scope, then there is the production period with a high workload and steady progress, and then there is the finalization period with lower workload but that takes time due to many interfaces being involved in QA prior to delivery. Real projects, however, end up behind schedule, the project execution is intensified towards the end, in order to regain the lost time. Resource availability, however, is typically not updated to the new reality, and hence, there is need for overtime. We may call the three phases of real projects

  1. Familiarization (as planned)
  2. Waiting for input and getting delayed (not really as planned)
  3. Panic phase to delivery (really not as planned)

Al of this is illustrated in the graph below.

 

What is the effect of this sort of bad management? Overworked people, loss of quality, increased price and inevitable delays. We have seen this many times – we do we still keep on doing it? The antidote is

  1. Better follow-up of stakeholders from the beginning
  2. Allow some slack in the schedule to avoid cascading schedule impacts
  3. If adjustment of plan is needed, do not forget to adjust the manning plan as well – excessive overtime produces bad output

 

When this happens in safety management projects, it is reason to get worried; we really do not want subpar performance of the project output. So, for the safety of our people we owe them to manage resources in projects in a better way – don’t overlook the obvious.

Is the necessary SIL related to layers of protection or operating practices?

A safety integrity level is a quantification of the necessary risk reduction we need from an automated safety system to achieve acceptable risk levels for some industrial system. The necessary risk reduction, obviously depends also on other activities and systems we put in place to reduce risk from its “intrinsic” level. The following drawing illustrates the role of different things we can do to achieve acceptable risk for a technical asset.

Figure showing how risk reducing measures work together to bring the risk down to an acceptable level.
Figure showing how risk reducing measures work together to bring the risk down to an acceptable level.

Consider for example a steel tank that is filled with pressurized gas. One potential hazard here is overpressure in the tank, which may cause a leak and the gas can be both toxic and flammable – obviously a dangerous situation. When working with risk, we need to define what we mean by risk in terms of “acceptance criteria”. In this case, we may say that we accept an explosion due to leak of gas and ignition of the gas afterwords once every one million years – that is a frequency of 10-6 per year. The initial frequency is maybe 0.1 per year, if the source of the high pressure is a controller intended to keep the pressure steady over time by adjusting a valve. Normally, such process control loops have one malfunction every 10 years (a coarse rule of thumb). Passive technologis can here be a spring-loaded safety valve that would open on high pressure and let the gas out to a safe location, for example a flare system where the gas can be burnt off in a controlled manner. This reduces the probability by 99% (such a passive valve tends to fail no more often than 1 out of 100 times). In addition to this, there is an independent alarm on the tank, giving a message to an operator in a control room that the pressure is increasing, and the oprator has time to go and check what is going on, and shut off supply of gas to the tank by closing a manual valve. How reliabile is this operator? With sufficient time, and allowing for some confusion due to stress, we may claim that the operator manages to intervene 9 out of 10 times (such numbers can be found by looking at human reliability analysis – a technique for assessing performance of trained people under various situations – developed primarily within the nuclear industry). In addition, a terrible explosion does not automatically happen if there is a leak – something needs to ignite the gas. Depending on the ignition sources we can assign a probability to this (models exist). For this case, let us assume the probability of ignition of a gas cloud in this location is 10%. We have now reduced the probability of this occuring by a factor of 1000 from an initial “intrinsic” frequency of 0.01. The frequency of such explosions due to leak in the tank before using any automatic shutdown system is thus 0.01 x 0.001 = 0.00001 = 10-5. The remaining reduction needed to bring the frequency down to 1 in a million years for the explosion is then an automated shutdown function that does not fail more than 1 out 10 demands – a PFD of 0.1. This means, we need a safety instrumented function with a probability of failure on demand of 0.1 – which corresponds to a SIL 1 requirement. The process we used to deduce this number is by the way known as a LOPA – a layers of protection analysis. The LOPA is one of many tools in the engineer’s toolbox for performing risk assessments.

What this illustrates is that the requirement to an automated shutdown function depends on other risk mitigation efforts – and the reliability of those barrier elements. What if the operator does not have time to intervene or cannot be trusted? If we take away the effect of the operator’s actions we see immediately that we need a SIL 2 function to achieve acceptable level of safety.

What does a “SIL” requirement really mean?

Safety instrumented systems are often assigned a “Safety Integrity Level”: This is an important concept for ensuring that automatic controls intended to maintain the safety of a technical safety actually bring the risk reduction that is necessary. In the reliability standards IEC 61508 and IEC 61511, there are 4 SILs:

  • SIL 1: a failure on demand in 1 out of 10 demands is acceptable
  • SIL 2: a failure on demand in 1 out of 100 demands is acceptable
  • SIL 3: a failure on demand in 1 out of 1 000 demands is acceptable
  • SIL 4: a failure on demand in 1 out of 10 000 demands is acceptable

This way of defining the probability of failure applies to so-called “low-demand” systems. In practice that means that the safety function does not need to act more than once per year in order to stop an accident from occurring.

The SIL requirement does not only involve probability calculations (Probability for failure on demand = PFD). The SIL consists of four diffent types of requirements:

  • Quantitative requirement (PFD, defined as probability of failure when there is a demand for the function)
  • Semi-quantitative requirements (requirement for redundancy, for a certain number of possible failures of the system leading to a safe state – the socalled safe failure fraction)
  • Software requirements (a lot of the actual control functionality is implemented in software. For this a work process oriented take on things is required by the standards – implications increase in rigor with increasing SIL)
  • Qualitative requirements (avoidance of systematic errors, quality mangement, etc.)

Most people focus only on the quantitative part and do not think about the latter thre parts. In order for us to have trust in the probability assessment, it is necessary that issues that cannot be quanitifed are properly managed. Hence – to claim that you have achived a certain SIL for your safety function, you need to document that the redundancy is right, that most failures will lead to a safet state, that your software has been developed in accordance with required practices and using acceptable technologies, and that your organization and workflows ensure sufficient quality of your safety function product and the system it is a part of.

If people buying components for safety instrumented systems would keep this in mind – it would become much easier to actually create safety critical automation systems with can trust with a given level of integrity.

Planning lifecycle activities for safety instrumented systems

Modern industrial safety instrumented systems are often required to be designed in accordance with IEC 61508 or IEC 61511. These standards about functional safety take a lifecycle view on the safety instrumented system. Most people associate this with SIL – or safety integrity levels, which is an important concept in these standards. Many newcomers to functional safety focus only on quantitative measures of reliability and do not engage with the lifecycle process. This leads to poorer designs than necessary, and compliance with requirements from these standards is not possible without taking the whole lifecycle into account.

A good way to look at a safety instrumented system, is to define phases of the lifecycle, and then assign activities for managing the safety instrumented system throughout these phases. Based on IEC 61511 we can define these phases as:

  • Design
  • Construction
  • Commissioning
  • Operation and maintenance
  • Decomissioning

In other words – we need to manage the safety instrumented system from conception to grave – in line with asset management thinking in general. For each these phases there will typically be various activities related to the safety instrumented system that we will need to focus on. For example, in the design phase we need to focus on identifying the necessary risk reduction, performing risk analysis and determining necessary SILs for the different safety instrumented functions making up the system. A key document emerging from this phase is the “Safety Requirement Specification”. Typically, in the same phase one would start to map out vendors and put out requests for offers on equipment to buy. A guideline for vendors on what type of documentation they should provide would also be good to prepare in this early phase. The Norwegian oil and gas association has made a very nice guideline (Guideline No. 070) for application of functional safety in the oil industry; this guideline contains a very good description of what type of documentation would need to be collected. This is a good starting point.

Also part of design, and typically lasting into the construction phase as well, we would find activities such as compliance assessment (it is necessary to check whether the requirements in the SRS are actually fulfilled, based on documentation form eqipment vendors and system integrators). In addition, at this point it is necessary to complete a Functional Safey Assessment (FSA), a third-party review in the form of an audit to check that the work has been done the way the standards require us to.

Part of the plan should be on how to commission the safety instrumented system. When are the different functions tested, what type of verifications are we doing on the programming of actions based on inputs? Who is responsibel for this? All of this should be planned out from the start.

Further, when the system is taken into operation, the complete asset (including the SIS) is delivered to the company that is going to operate it. The owner is then responsible for maintenance of the system, for proof testing and ensuring that all barrier elements necessary for the system to work are in place. These type of activities should be planned as well.

Finally, the end-of-life for the asset should be managed. How to actually manage that should be part of the plan – taking the system out of service as a whole or only in parts shoudl still be done while maintaining the right level of safety for people, environment and other assets that may be harmed if an accident should occur.

Finally, there are a number of aspects that should be included in a plan for managing functional safety, that span over all these lifecycle phases. These are things like competence management of people involved in working with the SIS in the different lifecycles, how to deal with changes of the system or the environment the system is operating in, who is responsible for what and how to communicate across company interfaces – this list is not exhaustive. Consult the standards for looking at the details.

If all organizations involved in functional safety design would plan out their acitivites in a good way fewer changes would occur towards the end of large engineering projects, better quality would be obtained at a lower cost. And this, is a low-hanging fruit that we all should grab.