Taking the human factor into account when setting SIL requirements

A well-known fact from accident investigations is that the human factor plays a huge role. In many large accidents, the enquiry will mention organizational factors, leadership focus, procedures and training as important factors in a complex picture involving both human factors and technological factors. In the oil and gas industry it has been found that more than half of the gas leaks detected offshore are down to human factors and errors made during operation, maintenance or startup. On the other hand – humans may also play the role of the safeguard – an operator may choose to shut down a unit behaving suspiciously prior to any dangerous situation occurring, a vehicle driver may slow down to avoid relying heavily on the ABS system for braking on icy roads, an electrician suggests to exchange a discolored socket that otherwise is well-functioning. All of these are human actions that lower the risk. The human thus always comes into the risk picture and can both enhance the safety, and threaten the safety of an asset. This all depends on leadership, training, organizational maturity and attitudes. How do we deal with this in the context of safety integrity levels?

There are many practices. There are thorough methodologies for analysis of human performance as part of barrier systems available, such as human reliability analysis (HRA), developed first in the nuclear industry but now also commonplace in many sectors (petroleum, chemical industry, aviation and transport). On the other side there are the extremes of assuming “humans always fail to do the correct thing” or “humans always do the right thing”. When performing a SIL allocation analysis using typical methods for this such as layers of protection analysis or RiskGraph (both described in IEC 61511), an important thing to consider is: can the bad things be avoided by human intervention? In many cases humans can intervene, and then we do need to have a notion of how reliable the human is. Human performance is influenced by many factors, and these factors are analyzed in depth in the framework of HRA. During a LOPA very detailed analysis of the human contribution is usually not within the scope, and a more simple approach is taken. However, there are some important questions we can bring from the HRA toolbox that will help us build more trust into the numbers we use in the LOPA, or the trust we put in this barrier element in the RiskGraph:

  • Is the operator well-trained and is the task easy to understand?
  • Does the operator have the necessary experience?
  • Does the organization have a positive safety culture?
  • Are there many tasks to handle at once and no clear priorities?
  • Is the situation stressful?
  • Does the operator have time to comprehend the situation, analyze the next action and execute before it is too late?

In many cases the operator will be well-trained exactly for the accident scenarios in question. Also, if designed correctly, there will be clear alarm prioritization and helpful messages from the alarm system – but it is always good to challenge this because quality of alarm design is varying a lot in practice. The situation is almost always stressful if the consequence of the accident is grave and there is some confusion to the situation but training can do wonders in handling such situations by resorting to reflex operating steps – think of basic training of field skills in the military. The last question is always important – does the operator have enough time? What enough time is, can also be hard to give a fixed limit on; for simple situations it is maybe sufficient with 10-15 minutes, whereas for more complex situations maybe a full hour would be needed for human intervention to be a trustworthy barrier element. Companies may have different guidelines regarding these factors – it should always be considered if these guidelines are in line with current knowledge of human performance. No shorter reaction times than 15 minutes should be allowed in analysis if credit is given to the operator. For unusual scenarios, such as the case is for “low-demand” safety functions, a PFD of the human intervention lower than 10% should not be used.

Giving credit to human intervention in SIL allocation is good practice – but the credit given should be realistic based on what we know about how humans react in these situations. Due to the large uncertainty, especially when performing a “quick-and-dirty” shortcut analysis such as discussed above, conservative values for human error should be assumed.

Also note that when a human action is included as an “independent protection layer” in a LOPA, the integrity of the entire barrier system includes this action as well. This means that in order to have control over barrier integrity, the company must carefully manage the underlying factors such as organizational maturity, safety leadership and competence management. Increased attention also to these factors in internal hazard reviews could lead to improved safety performance; maybe could the number of accidents with human errors as root cause be significantly reduced through more structured inclusion of human elements in barrier management thinking.

How independent should your FSA leader be?

Functional safety assessment is a mandatory 3rd party review/audit for functional safety work, and is required by most reliability standards. In line with good auditing practice, the FSA leader should be independent of the project development. Exactly what does this mean? Practice varies from company to company, from sector to sector and even from project to project. It seems reasonable to require a greater degree of independence for projects where the risks managed through the SIS are more serious. IEC 61511 requires (Clause 5.2.6.1.2) that functional safety assessments are conducted with “at least one senior competent person not involved in the project design team”. In a note to this clause the standard remarks that the planner should consider the independence of the assessment team (among other things). This is hardly conclusive.

If we go to the mother standard IEC 61508, requirements are slightly more explicit, as given by Clause 8.2.15 of IEC 61508-1:2010, which states that the level of independence shall be linked to perceived consequence class and required SILs. For major accident hazards, two categories are used in IEC 61508:

  • Class C: death to several people
  • Class D: very many people killed

For class C the standard accepts the use of an FSA team from “independent department”, whereas for class D only an “independent organization” is acceptable. Further, also for class C, an independent organization should be used if the degree of complexity is high, the design is novel or the design organization is lacking experience with this particular type of design. There are also requirements based on systematic capability in terms of SIL but those are normally less stringent in the context of industrial processes than the consequence based requirements to FSA team independence. The standard also specifies that compliance to sector specific standards, such as IEC 61511, would make a different basis for consideration of independence acceptable.

In this context, the definitions of “independent department” and “independent department” are given in Part 4 of the standard. An independent department is separate from and distinct from departments responsible for activities which take place during the specified phase of the overall system or software lifecycle subject to the validation activity. This means also, that the line managers of those departments should not be the same person. An independent organization is separate by management and other resources from the organizations responsible for activities that take place during the lifecycle phase. This means, in practice, that the organization leading a HAZOP or LOPA should not perform the FSA for the same project if there are potential major accident hazards within the scope, and preferably also not if there are any significant fatal accident risks in the project. Considering the requirement of separate management and resource access, it is not a non-conformity if two different legal entities within the same corporate structure perform the different activities, provided they have separate budgets and leadership teams.

If we consider another sector specific standard, EN 50129 for RAMS management in the European railway sector, we see that similar independence requirements exist for third-party validation activities. Figure 6 in that standard seemingly allows the assessor to be a part of the same organization as an organization involved in SIS development, but further requires for this situation that the assessor has an authorization from the national safety authority, is completely independent form the project team and shall report directly to the safety authorities. In practice the independent assessor is in most cases from an independent organization.

It is thus highly recommended to have an FSA team from a separate organization for all major SIS developments intended to handle serious risks to personnel; this is in line with common auditing practice in other fields.

Why is this important? Because we are all humans. If we feel ownership to a certain process, product or affiliation with an organization, it will inevitably be more difficult for us to point out what is not so good. We do not want to hurt people we work with by stating that their work is not good enough – even if we know that inferior quality in a safety instrumented system may actually lead to workers getting killed at work later. If we look to another field with the same type of challenges but potentially more guidance on independence, we can refer to the Sarbanes-Oxley act of 2002 from the United States. The SEC has issued guidelines about auditor independence and what should be assessed. Specifically they include:

  1. Will a relationship with the auditor create a mutual or conflicting interest with their audit client?
  2. Will the relationship place the auditor in the position of auditing his/her own work?
  3. Will the relationship result in the auditor acting as management or an employee of the audit client?
  4. Will the relationship result in the auditor being put in a position where he/she will act as an advocate for the audit client?

It would be prudent to consider at least these questions if considering using an organization that is already involved in the lifecycle phase subject to the FSA.

What is the demand rate on a safety function?

When we estimate the reliability of a safety instrumented function, we separate between “low demand” functions and “high demand” or “continuous demand” functions. These are all safety critical functions but their nature is different in terms of how frequently they must act on the system of study.

Consider for example the braking system on a train – the brakes need to work every time they are used – for every curve, for every station. The train driver will activate the brakes several times every hour. Obviously, this is a “high demand” system. As an example of the opposite, think of systems where we are monitoring some process and only acting if the system detects a dangerous state. A common example of this is an over-temperature trip on a heating system; if the temperature becomes too high in the system, the system shuts off power to the heater through a circuit breaker (assuming it is an electrical heater). Nobody will design a system such that its intention is to overheat, so this function would only need to activate when a specific scenario hits. Whether this is a “low demand” or “high demand” function depends on how often the function must work – and this again depends both on the “intrinsic frequency” of overheating, and other protection measures that may exist such as independent alarms or special operator training and procedures.

If you think of the stove guard installed in your kitchen that monitors overheating in the range area, what would the demand rate be? If we assume you are a 25-year old person and normally functioning as long as you are sober, you would not forget to turn off the plate on the oven more than once per year. In addition to this, you may get really drunk 10 times per year, and you cook something some of those times, with a higher probability of forgetting – say this also happens once per year – then you have an initial event rate of 2 times per year. Is this the demand rate on the stove guard function? It depends. Do you have any other measures that help you reduce the fire risk?

Typically you would have a smoke detector with alarm, and possibly the stove guard would also give you a pre-alarm. The smoke detector would be completely independent from the stove guard – and if it goes off you would react to it. This takes down the demand on the stove guard if you look at it solely as a way to stop fires from occurring (smoke comes before fire). We now assume that the smoke alarm works 1 out of 10 times and that you or a (sober) friend would always react correctly in this situation – it is normally easy to identify the smoke coming from the kitchen. Then we have reduced the demand on the stove guard to 2 x 0.1 = 0.2 times per year. This is safe to put in the bracket “low demand”. We did not count the pre-alarm on the stove guard itself on purpose, because it can have common cause failures with the core functionality of the stove guard – if one fails, the other one fails too.

The next natural question to ask in this connection is: “how reliable must the stove guard be”? We may conservatively assume that every 10th time there is a real demand on the guard and it does fail there will be a fire that can kill you and destroy the house. This risk is quite severe, and say you would only allow your house on average to burn down every 10.000 years, statistically speaking. This is your “acceptance criterion”. That is, you accept 0.0001 fires per year due to this potential source. We know the demand is 0.2 times per year – what is the allowable probability of failure on demand for the stove guard? This would be 0.0001 / 0.02 = 0.005. This means that we should require the system to have SIL 2 performance with a PFD of minimum 0.005 with this acceptance criterion, if we have a system developed in accordance with IEC 61508.

As a side note – the Norwegian research institute SINTEF has tested some stove guards. They tested 7 different types, and concluded that only 3 of them worked well. The reliability of the devices also depend on installation (location of sensors). This means that close to SIL 3 performance seems unreasonable to expect for the solutions on the market today. The SINTEF report is found at the Norwegian Directorate for Civil Protection’s website.

Electrical isolation of ignition sources on offshore installations

triangle-of-fireOne of the typical major accident scenarios considered when building and operating an offshore drilling or production rig, is a gas leak that is ignited, leading to a jet fire, or even worse, an explosion. For this scenario to happen we need three things (from the fire triangle):

  • Flammable material (the gas)
  • Oxygen (air)
  • An ignition source

The primary protection against such accidents is containment of flammable materials; avoiding leaks is the top safety priority offshore. As a lot of this equipment exists outdoors (or in “naturally ventilated areas” as standards tend to call it), it is not an option to remove “air”. Removing the ignition source hence becomes very important, in the event that you have a gas leak. The technical system used to achieve this consists of a large number of distributed gas detectors on the installation, sending message of detected gas to a controller, which then sends a signal to shut down all potential ignition sources (ie non-EX certified equipment, see the ATEX directive for details).

This being the barrier between “not much happening” and “a major disaster”, the reliability of this ignition source control is very important. Ignition sources are normally electrical systems not designed specifically to avoid ignition (so-called EX certified equipment). In order to have sufficient reliability of this set-up the number of ignition sources should be kept at a minimum; this means that the non-EX equipment should be grouped in distribution boards such that an incomer breaker can be used to isolate the whole group, instead of doing it at the individual consumer level. This is much more reliable, as the probability of a failure on demand (PFD) will contain an additive term for each of the breakers included:

PFD = PFD(Detector) + PFD(Logic) + Sum of PFD of each breaker

Consider a situation where you have 100 consumers, and the dangerous undetected failure rate for the breakers used is 10-7 failures per hour of operation, with testing every 24 months, the contribution from a single breaker is

PFD(Breaker) = 10-7 x (8760 x 2) / 2 = 0.000876

If we then have 6 breakers that need to open for full isolation, we have a breaker PFD contribution of 0.005 from the breakers (which means that with reliable gas detectors and logic solver, a full loop can satisfy a SIL 2 requirement). If we have 100 breakers the contribution to PFD is 0.08 – and the best we can hope for is SIL 1.

Stupid resource planning in projects

 

Most projects are supposed to work like an S-curve; first there is a slow kick-off period where people are familiarizing with the scope, then there is the production period with a high workload and steady progress, and then there is the finalization period with lower workload but that takes time due to many interfaces being involved in QA prior to delivery. Real projects, however, end up behind schedule, the project execution is intensified towards the end, in order to regain the lost time. Resource availability, however, is typically not updated to the new reality, and hence, there is need for overtime. We may call the three phases of real projects

  1. Familiarization (as planned)
  2. Waiting for input and getting delayed (not really as planned)
  3. Panic phase to delivery (really not as planned)

Al of this is illustrated in the graph below.

 

What is the effect of this sort of bad management? Overworked people, loss of quality, increased price and inevitable delays. We have seen this many times – we do we still keep on doing it? The antidote is

  1. Better follow-up of stakeholders from the beginning
  2. Allow some slack in the schedule to avoid cascading schedule impacts
  3. If adjustment of plan is needed, do not forget to adjust the manning plan as well – excessive overtime produces bad output

 

When this happens in safety management projects, it is reason to get worried; we really do not want subpar performance of the project output. So, for the safety of our people we owe them to manage resources in projects in a better way – don’t overlook the obvious.

New security requirements to safety instrumented systems in IEC 61511

IEC 61511 is undergoing revision and one of the more welcome changes is inclusion of cyber security clauses. According to a presentation held by functional safety expert Dr. Angela Summers at the Mary Kay Instrument Symposium in January 2015, the following clauses are now included in the new draft – the standard is planned issued in 2016:

  • Clause 8.2.4: Description of identified [security] threats for determination of requirements for additional risk reduction. There shall also be a description of measures taken to reduce or remove the hazards.
  • Clause 11.2.12: The SIS design shall provide the necessary resilience against the identified security risks

What does this mean for asset owners? It obviously makes it a requirement to perform a cyber security risk assessment for the safety instrumented systems (SIS). Such information asset risk assessments should, of course, be performed in any case for automation and safety systems. This, however, makes it necessary to keep security under control to obtain compliance with IEC 61511 – something that is often overlooked today, as described in this previous post. Further, when performing a security study, it is important that also human factors and organizational factors are taken into account – a good technical perimeter defense does not help if the users are not up to the task and have sufficient awareness of the security problem.

In the respect of organizational context, the new Clause 11.2.12 is particularly interesting as it will require security awareness and organizational resilience planning to be integrated into the functional safety management planning. As noted by many others, we have seen a sharp rise in attacks on SCADA systems over the past few years – these security requirements will bring the reliability and security fields together and ensure better overall risk management for important industrial assets. These benefits, however, will only be achieved if practitioners take the full weight of the new requirements on board.

Gas station’s tank monitoring systems open to cyber attacks

Darkreading.com brought news about a project to set up a free honeypot tool for monitoring attacks against gas tank monitoring systems. Researchers have found attacks against gas tank monitoring systems at several locations in the United States (read about it @darkreading). Interestingly, many of these systems for monitoring tank levels etc., are internet facing with no protection whatsoever – not even passwords. Attacks have so far only been of the cyberpunk type – changing a product’s name and the like; no intelligent attacks have been observed.

If we dwell on this situation a bit – we have to consider who would be interested in attacking gas station chains at a SCADA level? Obviously, if you can somehow halt the operation of all gas stations in a country, you do limit people’s mobility. In addition to that, you obviously harm the gas station’s business. Two of the most obvious attack motivations may thus be “sabotage against the nation as a whole” as part of a larger campaign, and pure criminal activity by using for example ransomware to halt gasoline sales until a ransom is payed. The latter would perhaps be the most likely of the two threats.

So – what should the gas stations do? Obviously, there are some technical barriers missing here when the system is completely open and facing the internet. The immediate solution would be to protect all network traffic by VPN tunneling, and to require a password for accessing the SCADA interfaces. Hopefully this will be done soon. The worrying aspect of this is that gas stations are not the only installation type with very weak security – there are many potential targets for black hats that are very easy to reach. The more connected our world becomes through integration of #IoT into our lives – the more important basic security measures become. Hopefully this will be realized not only by equipment vendors, but also by consumers.

The false sense of security people gain from firewalls

Firewalls are important to maintain security. On that, I suppose almost all of us agree. It is, however, not the final solution to the cyber security problem. First, there is the chance of bad guys pushing malware over traffic that is actually allowed through the firewall (people visiting bad web sites, for example). Then there is the chance that the firewall itself is set up in the wrong way. Finally, there is the possibility that people are bringing their horrible stuff inside the walled garden by using USB sticks, their own devices hooked up to the network, or similar. People running both IT and automation systems tend to be aware of all of these issues – and probably most users too. On the other hand, maybe not – but they should be aware of it and avoid doing obviously stupid stuff.

Then there is the oxymoron of the social engineer. For a skilled con artist it is easy to trick almost anyone by bribing them, using temptations (drugs, sex, money, fame, prestige, power, etc) or blackmailing them into helping an evil outsider. For some reason, companies tend to overlook this very human weakness in the defense layers. You normally do not find much mention of social engineering in operating policies, training and risk assessments for corporations running production critical IT systems, such as industrial control systems. Recent studies have shown that as many as 25% of people receiving phishing e-mails, actually click on links to websites with malware downloads. Tricksters are becoming more skilled – and the language in phishing e-mails has improved tremendously since the Viagra e-mail spam period of ten years ago. This can be summarized in a “tricking the dog” drawing:


Stuff that makes organizations easier to penetrate using social engineering includes:

  • Low employee loyalty due to underpay, bad working environment and psychotic bosses
  • Stressed employees and organizations in a state of constant overload
  • Lack of understanding of the production processes and what is critical
  • Insufficient confidentiality about IT infrastructure – allowing sys
  • tem to be analyzed from the outside
  • Lack of active follow-up of policies and practices such that security awareness erodes over time

In spite that this is well known – few organizations actually do something about that. The best defense against the social engineering attack vector may very well be a security awareness focus by the organization combined with efforts to create a good working environment and happy employees. That should be a win-win situation for both employees and the employer.

The good and bad sides of proof testing

Testing is an integral part of operation and maintenance of equipment with a SIL rating. Testing is necessary to ensure the achieved integrity of the safety instrumented function is actually as intended. First of all, the mere assumption of a test frequency  (hours between each proof test, τ) as a direct impact on the calculated probability of failure on demand (PFD) with a given failure rate for dangerous undetected failures, λDU :

PFD =  λDU x (τ/2)

This function is valid for calculating the average PFD for a single component. For redundant configurations things become more complicated but let us stick to this one for the sake of simplicity. Obviously, if we cut the number of hours between tests in half, we cut the PFD in half. So, the more often we test, the better it is – right? No – Wrong!

Why is that wrong? Testing has two negative sides:

  1. It stops production, which means it stops cash flow, which means it costs money there and then
  2. It is a source of errors itself, either through increased wear on the system or more likely, a possibility of human error like forgetting to put a system back into automatic mode after testing is done

Of course, this does not mean that we should not test – that is absolutely necessary to make sure the safety function works. Also, over time we can use the results from testing of our functions in the SIS to check whether the assumed failure rates are correct. What it means is – we need to find the right balance between the good side and bad side of testing. In practice, annual testing is often used, and this may be a sweet spot for test frequencies? Sometimes engineers are tempted to increase the test frequency to avoid trouble with PFD numbers after they have bough inferior equipment. People working on the installations tend to strongly oppose this – and rightly so. Buy good components, and test with a reasonable frequency to minimize the impact of the bad things about testing.

Operating systems and safety critical software – does the OS matter?

Safety critical software must be developed in accordance with certain practices, using specific tools fit for the required level of reliabliity, as well as by an organization with the right competence and maturity for developing such software. These software components are often parts of barrier management systems; bugs may lead to removal of critical functionality that again can lead to an accident situation with physical consequences, such as death, injuries, release of pollutants to the environment and severe material damage.

It is therefore a key question whether such software should be able to run under consumer grade operating systems, not conforming to any reliability development practices? The primary argument why this should be allowed from some vendors is “proven in use”; that they have used the software under said operating system for so many operating hours without incident, such that they feel the system is safe as borne out of experience.

It immedately seems reasonable to put more trust in a system set up that has been field testet over time and shown the expected performance, than a system that has not been tested in the field. Most operarting systems are issued with known bugs in addition to unknown bugs, and a list of bugs will exist for patching. A prioritzation of criticality is made, and the patches are developed accordingly. For Linux systems this patching strategy may be somewhat less organized as development is more distributed and less managed; even a the kernel level. The problem is akin to the classical software security problem; if software with design flaws and bugs is released, any such flaws will be found when a vulnerability is found by externals, or an incident occurs that can show the flaw. The bug or flaw is always inherent in code, and typically stems from lack of good practices during design and code development. In theory, damage resulting from such bugs and flaws shall then be limited by patching the system. In the meantime it is thought that perimeter defences cancounteract the risk of a vulnerability being exploited (this argument may not even hold in the security situation). For bugs affecting the safety in the underlying system, this thinking is flawed because even a single accident may have unacceptable consequences – including loss of human life.

In reliability engineering it is disputed whether a “workflow oriented” or “reliability growth oriented” view on software development and reliability is the most fruitful. Both have their merits. The “ship and patch” thinking inherent in proven in use cases for software indicate a stronger belief in reliability growth concepts. These are models that try to link the number of faults per operating time of the software to duration of discovery period ; most of them are modeled as some form of a Poisson process. It is acknowledged that this is a probabilistic model of deterministic errors inherent in the software, and the stochastic element is whether the actual software states realizing the errors are visited during software execution.

Coming back to operating systems, we see that complexity of such systems have grown very rapidly over the last decades. Looking specifically at the Linux kernel, the development has been tracked over time. The first kernel had about 10.000 lines of code ( in 1990). For the development of kernel version 2.6.29 they added almost 10.000 lines of code per day. If a reliability growth concept is going to work for such a rapid growth in complexity, 10.000 lines of code must be analyzed daily and end up as completely bug-free – and to prove that it is necessary to test every software state for those 10.000 lines of code.

Some research exists to compare effect of coding practices. Microsoft stated in 1992 that they had about 10-20 errors per 1000 lines of code prior to testing and that about 0.5 errors per 1000 lines of code would exist in shipped products.

Compliant development gives no guarantees on flaw and bug-free software. The same goes for development following good security practices – vulnerabilities may still exist. These practices have, however, been developed to minimize the number of design flaws and bugs getting into the shipped product. Structured programming techniques have been shown to produce code with less than 0.1 defects per 1000 lines of code – basically by following a workflow oriented quality regime in tandem with testing. If we assume 0.5 errors per 1000 lines of code in the Linux kernel (the kernel is not the entire OS), we have an estimated 7500 undiscovered bugs in the shipped version of Linux kernel 3.2.

An international rating for security of operating systems exist, the EAL rating. Commercial grade systems have a rating of EAL 4, where as secure RTOS’s tend to be EAL5 (semiformally designed, verified and tested).

The summary seems to be that consumer grade OS’s for life critical automation systems is not the best of ideas – which is why we don’t see too many of them.