What is the demand rate on a safety function?

When we estimate the reliability of a safety instrumented function, we separate between “low demand” functions and “high demand” or “continuous demand” functions. These are all safety critical functions but their nature is different in terms of how frequently they must act on the system of study.

Consider for example the braking system on a train – the brakes need to work every time they are used – for every curve, for every station. The train driver will activate the brakes several times every hour. Obviously, this is a “high demand” system. As an example of the opposite, think of systems where we are monitoring some process and only acting if the system detects a dangerous state. A common example of this is an over-temperature trip on a heating system; if the temperature becomes too high in the system, the system shuts off power to the heater through a circuit breaker (assuming it is an electrical heater). Nobody will design a system such that its intention is to overheat, so this function would only need to activate when a specific scenario hits. Whether this is a “low demand” or “high demand” function depends on how often the function must work – and this again depends both on the “intrinsic frequency” of overheating, and other protection measures that may exist such as independent alarms or special operator training and procedures.

If you think of the stove guard installed in your kitchen that monitors overheating in the range area, what would the demand rate be? If we assume you are a 25-year old person and normally functioning as long as you are sober, you would not forget to turn off the plate on the oven more than once per year. In addition to this, you may get really drunk 10 times per year, and you cook something some of those times, with a higher probability of forgetting – say this also happens once per year – then you have an initial event rate of 2 times per year. Is this the demand rate on the stove guard function? It depends. Do you have any other measures that help you reduce the fire risk?

Typically you would have a smoke detector with alarm, and possibly the stove guard would also give you a pre-alarm. The smoke detector would be completely independent from the stove guard – and if it goes off you would react to it. This takes down the demand on the stove guard if you look at it solely as a way to stop fires from occurring (smoke comes before fire). We now assume that the smoke alarm works 1 out of 10 times and that you or a (sober) friend would always react correctly in this situation – it is normally easy to identify the smoke coming from the kitchen. Then we have reduced the demand on the stove guard to 2 x 0.1 = 0.2 times per year. This is safe to put in the bracket “low demand”. We did not count the pre-alarm on the stove guard itself on purpose, because it can have common cause failures with the core functionality of the stove guard – if one fails, the other one fails too.

The next natural question to ask in this connection is: “how reliable must the stove guard be”? We may conservatively assume that every 10th time there is a real demand on the guard and it does fail there will be a fire that can kill you and destroy the house. This risk is quite severe, and say you would only allow your house on average to burn down every 10.000 years, statistically speaking. This is your “acceptance criterion”. That is, you accept 0.0001 fires per year due to this potential source. We know the demand is 0.2 times per year – what is the allowable probability of failure on demand for the stove guard? This would be 0.0001 / 0.02 = 0.005. This means that we should require the system to have SIL 2 performance with a PFD of minimum 0.005 with this acceptance criterion, if we have a system developed in accordance with IEC 61508.

As a side note – the Norwegian research institute SINTEF has tested some stove guards. They tested 7 different types, and concluded that only 3 of them worked well. The reliability of the devices also depend on installation (location of sensors). This means that close to SIL 3 performance seems unreasonable to expect for the solutions on the market today. The SINTEF report is found at the Norwegian Directorate for Civil Protection’s website.

Electrical isolation of ignition sources on offshore installations

triangle-of-fireOne of the typical major accident scenarios considered when building and operating an offshore drilling or production rig, is a gas leak that is ignited, leading to a jet fire, or even worse, an explosion. For this scenario to happen we need three things (from the fire triangle):

  • Flammable material (the gas)
  • Oxygen (air)
  • An ignition source

The primary protection against such accidents is containment of flammable materials; avoiding leaks is the top safety priority offshore. As a lot of this equipment exists outdoors (or in “naturally ventilated areas” as standards tend to call it), it is not an option to remove “air”. Removing the ignition source hence becomes very important, in the event that you have a gas leak. The technical system used to achieve this consists of a large number of distributed gas detectors on the installation, sending message of detected gas to a controller, which then sends a signal to shut down all potential ignition sources (ie non-EX certified equipment, see the ATEX directive for details).

This being the barrier between “not much happening” and “a major disaster”, the reliability of this ignition source control is very important. Ignition sources are normally electrical systems not designed specifically to avoid ignition (so-called EX certified equipment). In order to have sufficient reliability of this set-up the number of ignition sources should be kept at a minimum; this means that the non-EX equipment should be grouped in distribution boards such that an incomer breaker can be used to isolate the whole group, instead of doing it at the individual consumer level. This is much more reliable, as the probability of a failure on demand (PFD) will contain an additive term for each of the breakers included:

PFD = PFD(Detector) + PFD(Logic) + Sum of PFD of each breaker

Consider a situation where you have 100 consumers, and the dangerous undetected failure rate for the breakers used is 10-7 failures per hour of operation, with testing every 24 months, the contribution from a single breaker is

PFD(Breaker) = 10-7 x (8760 x 2) / 2 = 0.000876

If we then have 6 breakers that need to open for full isolation, we have a breaker PFD contribution of 0.005 from the breakers (which means that with reliable gas detectors and logic solver, a full loop can satisfy a SIL 2 requirement). If we have 100 breakers the contribution to PFD is 0.08 – and the best we can hope for is SIL 1.

Stupid resource planning in projects

 

Most projects are supposed to work like an S-curve; first there is a slow kick-off period where people are familiarizing with the scope, then there is the production period with a high workload and steady progress, and then there is the finalization period with lower workload but that takes time due to many interfaces being involved in QA prior to delivery. Real projects, however, end up behind schedule, the project execution is intensified towards the end, in order to regain the lost time. Resource availability, however, is typically not updated to the new reality, and hence, there is need for overtime. We may call the three phases of real projects

  1. Familiarization (as planned)
  2. Waiting for input and getting delayed (not really as planned)
  3. Panic phase to delivery (really not as planned)

Al of this is illustrated in the graph below.

 

What is the effect of this sort of bad management? Overworked people, loss of quality, increased price and inevitable delays. We have seen this many times – we do we still keep on doing it? The antidote is

  1. Better follow-up of stakeholders from the beginning
  2. Allow some slack in the schedule to avoid cascading schedule impacts
  3. If adjustment of plan is needed, do not forget to adjust the manning plan as well – excessive overtime produces bad output

 

When this happens in safety management projects, it is reason to get worried; we really do not want subpar performance of the project output. So, for the safety of our people we owe them to manage resources in projects in a better way – don’t overlook the obvious.

New security requirements to safety instrumented systems in IEC 61511

IEC 61511 is undergoing revision and one of the more welcome changes is inclusion of cyber security clauses. According to a presentation held by functional safety expert Dr. Angela Summers at the Mary Kay Instrument Symposium in January 2015, the following clauses are now included in the new draft – the standard is planned issued in 2016:

  • Clause 8.2.4: Description of identified [security] threats for determination of requirements for additional risk reduction. There shall also be a description of measures taken to reduce or remove the hazards.
  • Clause 11.2.12: The SIS design shall provide the necessary resilience against the identified security risks

What does this mean for asset owners? It obviously makes it a requirement to perform a cyber security risk assessment for the safety instrumented systems (SIS). Such information asset risk assessments should, of course, be performed in any case for automation and safety systems. This, however, makes it necessary to keep security under control to obtain compliance with IEC 61511 – something that is often overlooked today, as described in this previous post. Further, when performing a security study, it is important that also human factors and organizational factors are taken into account – a good technical perimeter defense does not help if the users are not up to the task and have sufficient awareness of the security problem.

In the respect of organizational context, the new Clause 11.2.12 is particularly interesting as it will require security awareness and organizational resilience planning to be integrated into the functional safety management planning. As noted by many others, we have seen a sharp rise in attacks on SCADA systems over the past few years – these security requirements will bring the reliability and security fields together and ensure better overall risk management for important industrial assets. These benefits, however, will only be achieved if practitioners take the full weight of the new requirements on board.

Gas station’s tank monitoring systems open to cyber attacks

Darkreading.com brought news about a project to set up a free honeypot tool for monitoring attacks against gas tank monitoring systems. Researchers have found attacks against gas tank monitoring systems at several locations in the United States (read about it @darkreading). Interestingly, many of these systems for monitoring tank levels etc., are internet facing with no protection whatsoever – not even passwords. Attacks have so far only been of the cyberpunk type – changing a product’s name and the like; no intelligent attacks have been observed.

If we dwell on this situation a bit – we have to consider who would be interested in attacking gas station chains at a SCADA level? Obviously, if you can somehow halt the operation of all gas stations in a country, you do limit people’s mobility. In addition to that, you obviously harm the gas station’s business. Two of the most obvious attack motivations may thus be “sabotage against the nation as a whole” as part of a larger campaign, and pure criminal activity by using for example ransomware to halt gasoline sales until a ransom is payed. The latter would perhaps be the most likely of the two threats.

So – what should the gas stations do? Obviously, there are some technical barriers missing here when the system is completely open and facing the internet. The immediate solution would be to protect all network traffic by VPN tunneling, and to require a password for accessing the SCADA interfaces. Hopefully this will be done soon. The worrying aspect of this is that gas stations are not the only installation type with very weak security – there are many potential targets for black hats that are very easy to reach. The more connected our world becomes through integration of #IoT into our lives – the more important basic security measures become. Hopefully this will be realized not only by equipment vendors, but also by consumers.

The false sense of security people gain from firewalls

Firewalls are important to maintain security. On that, I suppose almost all of us agree. It is, however, not the final solution to the cyber security problem. First, there is the chance of bad guys pushing malware over traffic that is actually allowed through the firewall (people visiting bad web sites, for example). Then there is the chance that the firewall itself is set up in the wrong way. Finally, there is the possibility that people are bringing their horrible stuff inside the walled garden by using USB sticks, their own devices hooked up to the network, or similar. People running both IT and automation systems tend to be aware of all of these issues – and probably most users too. On the other hand, maybe not – but they should be aware of it and avoid doing obviously stupid stuff.

Then there is the oxymoron of the social engineer. For a skilled con artist it is easy to trick almost anyone by bribing them, using temptations (drugs, sex, money, fame, prestige, power, etc) or blackmailing them into helping an evil outsider. For some reason, companies tend to overlook this very human weakness in the defense layers. You normally do not find much mention of social engineering in operating policies, training and risk assessments for corporations running production critical IT systems, such as industrial control systems. Recent studies have shown that as many as 25% of people receiving phishing e-mails, actually click on links to websites with malware downloads. Tricksters are becoming more skilled – and the language in phishing e-mails has improved tremendously since the Viagra e-mail spam period of ten years ago. This can be summarized in a “tricking the dog” drawing:


Stuff that makes organizations easier to penetrate using social engineering includes:

  • Low employee loyalty due to underpay, bad working environment and psychotic bosses
  • Stressed employees and organizations in a state of constant overload
  • Lack of understanding of the production processes and what is critical
  • Insufficient confidentiality about IT infrastructure – allowing sys
  • tem to be analyzed from the outside
  • Lack of active follow-up of policies and practices such that security awareness erodes over time

In spite that this is well known – few organizations actually do something about that. The best defense against the social engineering attack vector may very well be a security awareness focus by the organization combined with efforts to create a good working environment and happy employees. That should be a win-win situation for both employees and the employer.

The good and bad sides of proof testing

Testing is an integral part of operation and maintenance of equipment with a SIL rating. Testing is necessary to ensure the achieved integrity of the safety instrumented function is actually as intended. First of all, the mere assumption of a test frequency  (hours between each proof test, τ) as a direct impact on the calculated probability of failure on demand (PFD) with a given failure rate for dangerous undetected failures, λDU :

PFD =  λDU x (τ/2)

This function is valid for calculating the average PFD for a single component. For redundant configurations things become more complicated but let us stick to this one for the sake of simplicity. Obviously, if we cut the number of hours between tests in half, we cut the PFD in half. So, the more often we test, the better it is – right? No – Wrong!

Why is that wrong? Testing has two negative sides:

  1. It stops production, which means it stops cash flow, which means it costs money there and then
  2. It is a source of errors itself, either through increased wear on the system or more likely, a possibility of human error like forgetting to put a system back into automatic mode after testing is done

Of course, this does not mean that we should not test – that is absolutely necessary to make sure the safety function works. Also, over time we can use the results from testing of our functions in the SIS to check whether the assumed failure rates are correct. What it means is – we need to find the right balance between the good side and bad side of testing. In practice, annual testing is often used, and this may be a sweet spot for test frequencies? Sometimes engineers are tempted to increase the test frequency to avoid trouble with PFD numbers after they have bough inferior equipment. People working on the installations tend to strongly oppose this – and rightly so. Buy good components, and test with a reasonable frequency to minimize the impact of the bad things about testing.

Operating systems and safety critical software – does the OS matter?

Safety critical software must be developed in accordance with certain practices, using specific tools fit for the required level of reliabliity, as well as by an organization with the right competence and maturity for developing such software. These software components are often parts of barrier management systems; bugs may lead to removal of critical functionality that again can lead to an accident situation with physical consequences, such as death, injuries, release of pollutants to the environment and severe material damage.

It is therefore a key question whether such software should be able to run under consumer grade operating systems, not conforming to any reliability development practices? The primary argument why this should be allowed from some vendors is “proven in use”; that they have used the software under said operating system for so many operating hours without incident, such that they feel the system is safe as borne out of experience.

It immedately seems reasonable to put more trust in a system set up that has been field testet over time and shown the expected performance, than a system that has not been tested in the field. Most operarting systems are issued with known bugs in addition to unknown bugs, and a list of bugs will exist for patching. A prioritzation of criticality is made, and the patches are developed accordingly. For Linux systems this patching strategy may be somewhat less organized as development is more distributed and less managed; even a the kernel level. The problem is akin to the classical software security problem; if software with design flaws and bugs is released, any such flaws will be found when a vulnerability is found by externals, or an incident occurs that can show the flaw. The bug or flaw is always inherent in code, and typically stems from lack of good practices during design and code development. In theory, damage resulting from such bugs and flaws shall then be limited by patching the system. In the meantime it is thought that perimeter defences cancounteract the risk of a vulnerability being exploited (this argument may not even hold in the security situation). For bugs affecting the safety in the underlying system, this thinking is flawed because even a single accident may have unacceptable consequences – including loss of human life.

In reliability engineering it is disputed whether a “workflow oriented” or “reliability growth oriented” view on software development and reliability is the most fruitful. Both have their merits. The “ship and patch” thinking inherent in proven in use cases for software indicate a stronger belief in reliability growth concepts. These are models that try to link the number of faults per operating time of the software to duration of discovery period ; most of them are modeled as some form of a Poisson process. It is acknowledged that this is a probabilistic model of deterministic errors inherent in the software, and the stochastic element is whether the actual software states realizing the errors are visited during software execution.

Coming back to operating systems, we see that complexity of such systems have grown very rapidly over the last decades. Looking specifically at the Linux kernel, the development has been tracked over time. The first kernel had about 10.000 lines of code ( in 1990). For the development of kernel version 2.6.29 they added almost 10.000 lines of code per day. If a reliability growth concept is going to work for such a rapid growth in complexity, 10.000 lines of code must be analyzed daily and end up as completely bug-free – and to prove that it is necessary to test every software state for those 10.000 lines of code.

Some research exists to compare effect of coding practices. Microsoft stated in 1992 that they had about 10-20 errors per 1000 lines of code prior to testing and that about 0.5 errors per 1000 lines of code would exist in shipped products.

Compliant development gives no guarantees on flaw and bug-free software. The same goes for development following good security practices – vulnerabilities may still exist. These practices have, however, been developed to minimize the number of design flaws and bugs getting into the shipped product. Structured programming techniques have been shown to produce code with less than 0.1 defects per 1000 lines of code – basically by following a workflow oriented quality regime in tandem with testing. If we assume 0.5 errors per 1000 lines of code in the Linux kernel (the kernel is not the entire OS), we have an estimated 7500 undiscovered bugs in the shipped version of Linux kernel 3.2.

An international rating for security of operating systems exist, the EAL rating. Commercial grade systems have a rating of EAL 4, where as secure RTOS’s tend to be EAL5 (semiformally designed, verified and tested).

The summary seems to be that consumer grade OS’s for life critical automation systems is not the best of ideas – which is why we don’t see too many of them.

What is the difference between software and hardware failures in a reliability context?

Reliability engineers have traditionally focused more on hardware than software. There are many reasons for this; one reason is that traditionally safety systems have been based on analog electronics, and although digitial controls and PLC’s have been introduced throughout the 1990’s, the actual software involved was in the beginning very simple. Today the situation has really changed, but the focus in reliability has not completely taken this onboard. One of the reasons may be that reliability experts like to calculate probabilities – which they are very good at doing for hardware failures. Hardware failures tend to be random and can be modeled quite well using probabilistic tools. So – what about software? The failure mechanisms are very different – as failures in hardware are related to more or less stochastic effects stemming from load cycling, material defects and ageing, software defects or completely deterministic (we disregard stochastic algorithms here – they are banned from use in safety critical control system anyway).

Software defects exist for two reasons: design errors (flaws) and implementation errors (bugs). These errors may occur at the requirement stage or during actual coding, but irrespective of the time they occur, they are always static. They do not suddenly occur – they are latent errors hidden within the code – that will active each and every time the software state where the error is relevant is visited.

Such errors are very difficult to include in a probabilistic model. That is why reliability standards prescribe a completely different medicine; a process oriented framework that gives requirements to management, choice of methods and tools, as well as testing and documentation. These quality directed workflows and requirements are put in place such that we should have some confidence in the software not being a significant source of unsafe failures of the critical control system.

Hence – process verification and auditing take the place of probability calculations when we look at the software. In order to achieve the desired level of trust it is very important that these practices are not neglected in the functional safety work. Deterministic errors may be just as catastrophic as random ones – and therefore they must be managed with just as much rigor and care. The current trend is that more and more functionality is moved from hardware to software – which means that software errors are becoming increasingly important to manage correctly if we are not going to degrade both performance and trust of the safety instrumented systems we rely on to protect our lives, assets and the environment.

Does safety engineering require security engineering?

Safetey critical control systems are developed with respect to reliability requirements, often following a reliability standard such as IEC 61508 or CENELEC EN 50128. These standards put requirements on development practices and activities with regard to creating software that works the way it is intended based on the expected input, and where availability and integrity is of paramount importance. However, these standards do not address information security. Some of the practices required from reliability standards do help in removing bugs and design flaws – which to a large extent also removes security vulnerabilites – but they do not explicitly express such conceerns. Reliability engineering is about building trust into the intended functionality of the system. Security is about lack of unintended functionality.

Consider a typical safety critical system installed in an industrial process, such as an overpressure protection system. Such a system may consist of a pressure transmitter, a logic unit (ie a computer) and some final elements. This simple system meausres the pressure  and transmits it to the computer, typically over a hardwired analog connection. The computer then decides if the system is within a safe operating region, or above a set point for stopping operation. If we are in the unsafe region, the computer tells the final element to trip the process, for example by flipping an electrical circuit breaker or closing a valve. Reliability standards that include software development requirements focus on how development is going to work in order to ensure that whenever the sensor transmits pressure above the threshold, the computer will tell the process to stop. Further the computer is connected over a network to an engineering station which is used for such things as updating the algorithm in the control system, changing the threshold limits, etc.

What if someone wants to put the system out of order, without anyone noticing? The software’s access control would be a crucial barrier against anyone tampering with the functionality. Reliability standards do not talk about how to actually avoid weak authentication schemes, although they talk about access management in general. You may very well be compliant with the reliability standard – yet have very weak protection against compromising the access control. For example, the coder may very well use a “getuser()” call  in C in the authentication part of the software – without violating the reliability standard requirements. This is a very unsecure way of getting user credentials from the computer and should generally be avoided. If such a practice is used, a hacker with access to the network could with relaitve ease  get admin access to the system and change for example set points, or worse, recalibrate the pressure  sensor to report wrong readings – something that was actually done in the Stuxnet case.

In other words – as long as someone can be interested in harming your operation – your safety system needs security built-in, and that is not coming for free through reliability engineering. And there is always someone out to get you – for sports, for money or just because they do not like you. Managing security is an important part of managing your business risk – so do not neglect this issue while worrying only about reliability of intended functionality.