Electrical isolation of ignition sources on offshore installations

triangle-of-fireOne of the typical major accident scenarios considered when building and operating an offshore drilling or production rig, is a gas leak that is ignited, leading to a jet fire, or even worse, an explosion. For this scenario to happen we need three things (from the fire triangle):

  • Flammable material (the gas)
  • Oxygen (air)
  • An ignition source

The primary protection against such accidents is containment of flammable materials; avoiding leaks is the top safety priority offshore. As a lot of this equipment exists outdoors (or in “naturally ventilated areas” as standards tend to call it), it is not an option to remove “air”. Removing the ignition source hence becomes very important, in the event that you have a gas leak. The technical system used to achieve this consists of a large number of distributed gas detectors on the installation, sending message of detected gas to a controller, which then sends a signal to shut down all potential ignition sources (ie non-EX certified equipment, see the ATEX directive for details).

This being the barrier between “not much happening” and “a major disaster”, the reliability of this ignition source control is very important. Ignition sources are normally electrical systems not designed specifically to avoid ignition (so-called EX certified equipment). In order to have sufficient reliability of this set-up the number of ignition sources should be kept at a minimum; this means that the non-EX equipment should be grouped in distribution boards such that an incomer breaker can be used to isolate the whole group, instead of doing it at the individual consumer level. This is much more reliable, as the probability of a failure on demand (PFD) will contain an additive term for each of the breakers included:

PFD = PFD(Detector) + PFD(Logic) + Sum of PFD of each breaker

Consider a situation where you have 100 consumers, and the dangerous undetected failure rate for the breakers used is 10-7 failures per hour of operation, with testing every 24 months, the contribution from a single breaker is

PFD(Breaker) = 10-7 x (8760 x 2) / 2 = 0.000876

If we then have 6 breakers that need to open for full isolation, we have a breaker PFD contribution of 0.005 from the breakers (which means that with reliable gas detectors and logic solver, a full loop can satisfy a SIL 2 requirement). If we have 100 breakers the contribution to PFD is 0.08 – and the best we can hope for is SIL 1.

Stupid resource planning in projects

 

Most projects are supposed to work like an S-curve; first there is a slow kick-off period where people are familiarizing with the scope, then there is the production period with a high workload and steady progress, and then there is the finalization period with lower workload but that takes time due to many interfaces being involved in QA prior to delivery. Real projects, however, end up behind schedule, the project execution is intensified towards the end, in order to regain the lost time. Resource availability, however, is typically not updated to the new reality, and hence, there is need for overtime. We may call the three phases of real projects

  1. Familiarization (as planned)
  2. Waiting for input and getting delayed (not really as planned)
  3. Panic phase to delivery (really not as planned)

Al of this is illustrated in the graph below.

 

What is the effect of this sort of bad management? Overworked people, loss of quality, increased price and inevitable delays. We have seen this many times – we do we still keep on doing it? The antidote is

  1. Better follow-up of stakeholders from the beginning
  2. Allow some slack in the schedule to avoid cascading schedule impacts
  3. If adjustment of plan is needed, do not forget to adjust the manning plan as well – excessive overtime produces bad output

 

When this happens in safety management projects, it is reason to get worried; we really do not want subpar performance of the project output. So, for the safety of our people we owe them to manage resources in projects in a better way – don’t overlook the obvious.

New security requirements to safety instrumented systems in IEC 61511

IEC 61511 is undergoing revision and one of the more welcome changes is inclusion of cyber security clauses. According to a presentation held by functional safety expert Dr. Angela Summers at the Mary Kay Instrument Symposium in January 2015, the following clauses are now included in the new draft – the standard is planned issued in 2016:

  • Clause 8.2.4: Description of identified [security] threats for determination of requirements for additional risk reduction. There shall also be a description of measures taken to reduce or remove the hazards.
  • Clause 11.2.12: The SIS design shall provide the necessary resilience against the identified security risks

What does this mean for asset owners? It obviously makes it a requirement to perform a cyber security risk assessment for the safety instrumented systems (SIS). Such information asset risk assessments should, of course, be performed in any case for automation and safety systems. This, however, makes it necessary to keep security under control to obtain compliance with IEC 61511 – something that is often overlooked today, as described in this previous post. Further, when performing a security study, it is important that also human factors and organizational factors are taken into account – a good technical perimeter defense does not help if the users are not up to the task and have sufficient awareness of the security problem.

In the respect of organizational context, the new Clause 11.2.12 is particularly interesting as it will require security awareness and organizational resilience planning to be integrated into the functional safety management planning. As noted by many others, we have seen a sharp rise in attacks on SCADA systems over the past few years – these security requirements will bring the reliability and security fields together and ensure better overall risk management for important industrial assets. These benefits, however, will only be achieved if practitioners take the full weight of the new requirements on board.

Gas station’s tank monitoring systems open to cyber attacks

Darkreading.com brought news about a project to set up a free honeypot tool for monitoring attacks against gas tank monitoring systems. Researchers have found attacks against gas tank monitoring systems at several locations in the United States (read about it @darkreading). Interestingly, many of these systems for monitoring tank levels etc., are internet facing with no protection whatsoever – not even passwords. Attacks have so far only been of the cyberpunk type – changing a product’s name and the like; no intelligent attacks have been observed.

If we dwell on this situation a bit – we have to consider who would be interested in attacking gas station chains at a SCADA level? Obviously, if you can somehow halt the operation of all gas stations in a country, you do limit people’s mobility. In addition to that, you obviously harm the gas station’s business. Two of the most obvious attack motivations may thus be “sabotage against the nation as a whole” as part of a larger campaign, and pure criminal activity by using for example ransomware to halt gasoline sales until a ransom is payed. The latter would perhaps be the most likely of the two threats.

So – what should the gas stations do? Obviously, there are some technical barriers missing here when the system is completely open and facing the internet. The immediate solution would be to protect all network traffic by VPN tunneling, and to require a password for accessing the SCADA interfaces. Hopefully this will be done soon. The worrying aspect of this is that gas stations are not the only installation type with very weak security – there are many potential targets for black hats that are very easy to reach. The more connected our world becomes through integration of #IoT into our lives – the more important basic security measures become. Hopefully this will be realized not only by equipment vendors, but also by consumers.

The false sense of security people gain from firewalls

Firewalls are important to maintain security. On that, I suppose almost all of us agree. It is, however, not the final solution to the cyber security problem. First, there is the chance of bad guys pushing malware over traffic that is actually allowed through the firewall (people visiting bad web sites, for example). Then there is the chance that the firewall itself is set up in the wrong way. Finally, there is the possibility that people are bringing their horrible stuff inside the walled garden by using USB sticks, their own devices hooked up to the network, or similar. People running both IT and automation systems tend to be aware of all of these issues – and probably most users too. On the other hand, maybe not – but they should be aware of it and avoid doing obviously stupid stuff.

Then there is the oxymoron of the social engineer. For a skilled con artist it is easy to trick almost anyone by bribing them, using temptations (drugs, sex, money, fame, prestige, power, etc) or blackmailing them into helping an evil outsider. For some reason, companies tend to overlook this very human weakness in the defense layers. You normally do not find much mention of social engineering in operating policies, training and risk assessments for corporations running production critical IT systems, such as industrial control systems. Recent studies have shown that as many as 25% of people receiving phishing e-mails, actually click on links to websites with malware downloads. Tricksters are becoming more skilled – and the language in phishing e-mails has improved tremendously since the Viagra e-mail spam period of ten years ago. This can be summarized in a “tricking the dog” drawing:


Stuff that makes organizations easier to penetrate using social engineering includes:

  • Low employee loyalty due to underpay, bad working environment and psychotic bosses
  • Stressed employees and organizations in a state of constant overload
  • Lack of understanding of the production processes and what is critical
  • Insufficient confidentiality about IT infrastructure – allowing sys
  • tem to be analyzed from the outside
  • Lack of active follow-up of policies and practices such that security awareness erodes over time

In spite that this is well known – few organizations actually do something about that. The best defense against the social engineering attack vector may very well be a security awareness focus by the organization combined with efforts to create a good working environment and happy employees. That should be a win-win situation for both employees and the employer.

The good and bad sides of proof testing

Testing is an integral part of operation and maintenance of equipment with a SIL rating. Testing is necessary to ensure the achieved integrity of the safety instrumented function is actually as intended. First of all, the mere assumption of a test frequency  (hours between each proof test, τ) as a direct impact on the calculated probability of failure on demand (PFD) with a given failure rate for dangerous undetected failures, λDU :

PFD =  λDU x (τ/2)

This function is valid for calculating the average PFD for a single component. For redundant configurations things become more complicated but let us stick to this one for the sake of simplicity. Obviously, if we cut the number of hours between tests in half, we cut the PFD in half. So, the more often we test, the better it is – right? No – Wrong!

Why is that wrong? Testing has two negative sides:

  1. It stops production, which means it stops cash flow, which means it costs money there and then
  2. It is a source of errors itself, either through increased wear on the system or more likely, a possibility of human error like forgetting to put a system back into automatic mode after testing is done

Of course, this does not mean that we should not test – that is absolutely necessary to make sure the safety function works. Also, over time we can use the results from testing of our functions in the SIS to check whether the assumed failure rates are correct. What it means is – we need to find the right balance between the good side and bad side of testing. In practice, annual testing is often used, and this may be a sweet spot for test frequencies? Sometimes engineers are tempted to increase the test frequency to avoid trouble with PFD numbers after they have bough inferior equipment. People working on the installations tend to strongly oppose this – and rightly so. Buy good components, and test with a reasonable frequency to minimize the impact of the bad things about testing.

Operating systems and safety critical software – does the OS matter?

Safety critical software must be developed in accordance with certain practices, using specific tools fit for the required level of reliabliity, as well as by an organization with the right competence and maturity for developing such software. These software components are often parts of barrier management systems; bugs may lead to removal of critical functionality that again can lead to an accident situation with physical consequences, such as death, injuries, release of pollutants to the environment and severe material damage.

It is therefore a key question whether such software should be able to run under consumer grade operating systems, not conforming to any reliability development practices? The primary argument why this should be allowed from some vendors is “proven in use”; that they have used the software under said operating system for so many operating hours without incident, such that they feel the system is safe as borne out of experience.

It immedately seems reasonable to put more trust in a system set up that has been field testet over time and shown the expected performance, than a system that has not been tested in the field. Most operarting systems are issued with known bugs in addition to unknown bugs, and a list of bugs will exist for patching. A prioritzation of criticality is made, and the patches are developed accordingly. For Linux systems this patching strategy may be somewhat less organized as development is more distributed and less managed; even a the kernel level. The problem is akin to the classical software security problem; if software with design flaws and bugs is released, any such flaws will be found when a vulnerability is found by externals, or an incident occurs that can show the flaw. The bug or flaw is always inherent in code, and typically stems from lack of good practices during design and code development. In theory, damage resulting from such bugs and flaws shall then be limited by patching the system. In the meantime it is thought that perimeter defences cancounteract the risk of a vulnerability being exploited (this argument may not even hold in the security situation). For bugs affecting the safety in the underlying system, this thinking is flawed because even a single accident may have unacceptable consequences – including loss of human life.

In reliability engineering it is disputed whether a “workflow oriented” or “reliability growth oriented” view on software development and reliability is the most fruitful. Both have their merits. The “ship and patch” thinking inherent in proven in use cases for software indicate a stronger belief in reliability growth concepts. These are models that try to link the number of faults per operating time of the software to duration of discovery period ; most of them are modeled as some form of a Poisson process. It is acknowledged that this is a probabilistic model of deterministic errors inherent in the software, and the stochastic element is whether the actual software states realizing the errors are visited during software execution.

Coming back to operating systems, we see that complexity of such systems have grown very rapidly over the last decades. Looking specifically at the Linux kernel, the development has been tracked over time. The first kernel had about 10.000 lines of code ( in 1990). For the development of kernel version 2.6.29 they added almost 10.000 lines of code per day. If a reliability growth concept is going to work for such a rapid growth in complexity, 10.000 lines of code must be analyzed daily and end up as completely bug-free – and to prove that it is necessary to test every software state for those 10.000 lines of code.

Some research exists to compare effect of coding practices. Microsoft stated in 1992 that they had about 10-20 errors per 1000 lines of code prior to testing and that about 0.5 errors per 1000 lines of code would exist in shipped products.

Compliant development gives no guarantees on flaw and bug-free software. The same goes for development following good security practices – vulnerabilities may still exist. These practices have, however, been developed to minimize the number of design flaws and bugs getting into the shipped product. Structured programming techniques have been shown to produce code with less than 0.1 defects per 1000 lines of code – basically by following a workflow oriented quality regime in tandem with testing. If we assume 0.5 errors per 1000 lines of code in the Linux kernel (the kernel is not the entire OS), we have an estimated 7500 undiscovered bugs in the shipped version of Linux kernel 3.2.

An international rating for security of operating systems exist, the EAL rating. Commercial grade systems have a rating of EAL 4, where as secure RTOS’s tend to be EAL5 (semiformally designed, verified and tested).

The summary seems to be that consumer grade OS’s for life critical automation systems is not the best of ideas – which is why we don’t see too many of them.

What is the difference between software and hardware failures in a reliability context?

Reliability engineers have traditionally focused more on hardware than software. There are many reasons for this; one reason is that traditionally safety systems have been based on analog electronics, and although digitial controls and PLC’s have been introduced throughout the 1990’s, the actual software involved was in the beginning very simple. Today the situation has really changed, but the focus in reliability has not completely taken this onboard. One of the reasons may be that reliability experts like to calculate probabilities – which they are very good at doing for hardware failures. Hardware failures tend to be random and can be modeled quite well using probabilistic tools. So – what about software? The failure mechanisms are very different – as failures in hardware are related to more or less stochastic effects stemming from load cycling, material defects and ageing, software defects or completely deterministic (we disregard stochastic algorithms here – they are banned from use in safety critical control system anyway).

Software defects exist for two reasons: design errors (flaws) and implementation errors (bugs). These errors may occur at the requirement stage or during actual coding, but irrespective of the time they occur, they are always static. They do not suddenly occur – they are latent errors hidden within the code – that will active each and every time the software state where the error is relevant is visited.

Such errors are very difficult to include in a probabilistic model. That is why reliability standards prescribe a completely different medicine; a process oriented framework that gives requirements to management, choice of methods and tools, as well as testing and documentation. These quality directed workflows and requirements are put in place such that we should have some confidence in the software not being a significant source of unsafe failures of the critical control system.

Hence – process verification and auditing take the place of probability calculations when we look at the software. In order to achieve the desired level of trust it is very important that these practices are not neglected in the functional safety work. Deterministic errors may be just as catastrophic as random ones – and therefore they must be managed with just as much rigor and care. The current trend is that more and more functionality is moved from hardware to software – which means that software errors are becoming increasingly important to manage correctly if we are not going to degrade both performance and trust of the safety instrumented systems we rely on to protect our lives, assets and the environment.

Does safety engineering require security engineering?

Safetey critical control systems are developed with respect to reliability requirements, often following a reliability standard such as IEC 61508 or CENELEC EN 50128. These standards put requirements on development practices and activities with regard to creating software that works the way it is intended based on the expected input, and where availability and integrity is of paramount importance. However, these standards do not address information security. Some of the practices required from reliability standards do help in removing bugs and design flaws – which to a large extent also removes security vulnerabilites – but they do not explicitly express such conceerns. Reliability engineering is about building trust into the intended functionality of the system. Security is about lack of unintended functionality.

Consider a typical safety critical system installed in an industrial process, such as an overpressure protection system. Such a system may consist of a pressure transmitter, a logic unit (ie a computer) and some final elements. This simple system meausres the pressure  and transmits it to the computer, typically over a hardwired analog connection. The computer then decides if the system is within a safe operating region, or above a set point for stopping operation. If we are in the unsafe region, the computer tells the final element to trip the process, for example by flipping an electrical circuit breaker or closing a valve. Reliability standards that include software development requirements focus on how development is going to work in order to ensure that whenever the sensor transmits pressure above the threshold, the computer will tell the process to stop. Further the computer is connected over a network to an engineering station which is used for such things as updating the algorithm in the control system, changing the threshold limits, etc.

What if someone wants to put the system out of order, without anyone noticing? The software’s access control would be a crucial barrier against anyone tampering with the functionality. Reliability standards do not talk about how to actually avoid weak authentication schemes, although they talk about access management in general. You may very well be compliant with the reliability standard – yet have very weak protection against compromising the access control. For example, the coder may very well use a “getuser()” call  in C in the authentication part of the software – without violating the reliability standard requirements. This is a very unsecure way of getting user credentials from the computer and should generally be avoided. If such a practice is used, a hacker with access to the network could with relaitve ease  get admin access to the system and change for example set points, or worse, recalibrate the pressure  sensor to report wrong readings – something that was actually done in the Stuxnet case.

In other words – as long as someone can be interested in harming your operation – your safety system needs security built-in, and that is not coming for free through reliability engineering. And there is always someone out to get you – for sports, for money or just because they do not like you. Managing security is an important part of managing your business risk – so do not neglect this issue while worrying only about reliability of intended functionality.

Obtaining the necessary documenation from vendors – why is it so hard?

When buying technical items with certain requirements – such as reliability requirements / SIL requirements – obtaining documentation showing that the requirements are all met can be a headache. Especially for SIL projects this can be the case, in particular if a particular vendor is not used to providing the correct documentation. When this is the case – what can be done to make sure we do not get delays and problems with non-conformity? The answer is as logical and straight-forward to state, as it is diffcult to implement. Experience has shown that some practices may make obtaining compliance documentation easier – and it is all about communication.

FIrst of all – vendors must be made aware of the requirements at the time they are bidding for the sale – and not only a reference to a standard, but an actual explanation of what it means and what is exepcted. The party selling something should really try to understand this by asking the right questions…. but they don’t always do that.. So, communicating with the vendor from early on is important. This can be done by including requirements in the purchase order or contract.

This is, of course, not enough. So, vendor follow-up should be part of the planning of the engineering activities – just like you would plan to spend resources on requirement setting or participating in FAT’s!

When this has been planned – it is time to step up and help the vendor. Provide a guideline with what they need to deliver of documentation. Provide training to make them understand what is being asked of them – if they do not have the sufficient level of understanding. Make sure the responsible engineer for following up the delivery is also in the know about these requirements. Then the two contacts can speak the same language.

The vendor needs to know what the actual requirements are. It is therefore good practice to develop and supply the safety requirement specification as early as possible. Of course – that may lead to later changes as more information becomes available, but as long as everyone is on board with that and changes are properly managed, including across interfaces, this is a much better situation than requirements coming to vendors too late.

When the vendor has access to the SRS, the follow-up process should be intensified. Ask regular questions about progress – this may be done in an informal manner if the business climate is right. Open up for questions and be a support to the vendor in this process. Keep the conversation going. And make sure you get the progress you need. Tools for “expediting” progress should first and foremost be on the carrot side – let the vendor see real business benefit and value from providing good documentation and fulfilling expectations. Carrots can be things such as better chance of repeat business, improved vendor competence and standing in the market and possibly the option to become a shortlisted or preferred supplier? In addition to a bag of carrots, it may be necessary to carry a stick when soft talking no longer works. Metaphorical sticks may be things like payments tied to documentation deliverables, banning of supplier from future projects and general reputation damage. Hopefully you can keep the sticks hidden in the golf bag.

This provides no guearantee – but thinking about this stuff is infintiely better than not doing anything. My experience is that a little help brings a lot of progress.

To package these thoughts into a more condensed form you may refer to the following infographic – which sums up what to focus on during planning, design and follow-up.

Infographic showing how to follow up with vendors to obtain the necessary documentation on SIL compliance.
Infographic showing how to follow up with vendors to obtain the necessary documentation on SIL compliance.