SIL and ballast systems

Working on floating oil and gas facilities, one question keeps popping up about ballast systems. Should they have SIL requirements, and what should in this case the requirements be? When seeking to establish requirements for such systems, several issues are uncovered. First of all, current designs of ballast systems are very robust due to evolution of designs and requirements in shipping over a long time. Further, the problem is much more complex than collecting a few well-defined failure modes with random error data leading to a given situation, as typically seen in may process industry type problem descriptions. This complexity depends on a number of factors, and some of them are specific to each ship or installation, such as location, ship traffic density or operating practices of personnel onboard. Therefore, any quantitative estimates of “error probabilities” contributing to an expected return frequency of critical events concerning the system will have significant uncertainties associated with them.

ship-storm1

A ballast system is used to maintain the stability of a ship or a floating hull structure under varying cargo loading conditions and in various sea conditions and ship drafts. Water is kept in tanks dispersed around the hull structure, and can be pumped in or out, or transfered between tanks, to maintain stability. Errors in ballasting operations can lead to loss of stability, which in the worst consequence means a sunken ship. The ballasting operation is normally a semi manual operation where a marine operator would use a loading computer to guide decisions about ballasting, and manually give commands to a computer based control system on where to transfer water into or out of a particular ballast tank. Because this is such a critical safety system it is a natural question to ask: “what are the performance requirements?”.

Ballast systems have been part of shipping for hundreds of years. Requirements for ballast systems are thus set in the classification rules of ship classification societies, such as Lloyd’s Register, DNV GL or ABS. These requirements are typically presecriptive in nature and focus on robustness and avoidance of common cause failures in the technology. Maritime classification societies do not refer to safety integrity levels but rely on other means of ensuring safey operation and reliability. Society has accepted this practice for years, for very diverse vessels ranging from oil tankers to passenger cruise ships.

In oil and gas operations, the use of safety integrity levels to establish performance requirements for instrumented safety functions is the norm, and standards such as IEC 61508 are used as the point of reference. The Norwegian Oil and Gas Association has made a guideline that is normally applied for installations in Norwegian waters, which offers a simplification of requirements setting based on “typical performance”. This guideline can be freely downloaded from this web page. This guideline states that for “start of ballasting for rig re-establishment”, ths system should conform to a SIL 1 requirement. The “system” is described as consisting of a ballast control node, 2 x 100% pumps and three ballast valves. In appendix A.12 of the guideline a description of this “sub-function” is given with a calculation of achievable performance.

It may be argued that this functional description is somewhat artificial because the ballast system on a production installation is normally operated more or less continously. The function is defined for a single ballast tank/compartment, irrespective of the number of tanks and the necessary load balancing for re-establishing stability. The Guideline 070 approach is based on “typical performance” of the safety system as it is defined, and is not linked directly to the required risk reduction provided from the system. Multiple approaches may be taken to assign safety integrity levels based on risk analysis, see for example IEC 61508. One such method that is particularly common in the process industries and the oil and gas industry is “layers or protection analysis”, or LOPA for short. In this type of study, multiple initating events can contribute to one hazard situation, for example “sunken ship due to loss of stability”. Multiple barriers or “independent protection layers” can be credited for reducing the risk of this hazard being realized. In order to use a risk based method for setting the integrity requirement, it is necessary to define what is an acceptable frequency of this event occurring. Let us say for the sake of the discussion that it is acceptable that the mean time between each “sunken ship due to loss of stability” is 1 million years. How can we reason about this to establish requirements for the ballast system? The functional requirement is that we should “be able to shift ballast loading to re-establish stability before condition is made unrecoverable”. In order to start analyzing this situation, we need to estimate how often we will have a condition that can lead to such an unrecoverable situation if not correctly managed. Let us consider three such “initiating events”:

  • Loading operator error during routine ballasting (human error)
  • Damage to hull due to external impact
  • Error in load computer calculations

Both of these situations depend on a number of factors. The probability that the loading operator will perform an erronous situation depends on stress levels, competence/training and management factors. A throrough analysis using “human reliability analysis” can be performed, or a more simplified approach may be taken. We may, for example, make the assumption that the average operator makes 1 error without noticing immediately every 100 years (this is an assumption – must be validated if used).

Damage to hull due to external impact would depend on the ship traffic density in the area, if there is a difficult political condition (war, etc.), or if you are operating in arctic environments where ice impact is likely (think Titanic). Again, you may do extensive analysis to establish such data, or make some assumptions based on expert judgment. For example, we may assume a penetrating ship collition every 100 years on average.

What about erros in load computer calculations? Do the operators trust the load computer blindly, or do they perform sanity checks? How was the load computer programmed? Is the software mature? Is the loading condition unusual? Many questions may be asked here as well. For the sake of this example, let us assume there is no contribution from the loading computer.

We are then looking at an average initiating event frequency of 0.1 for human errors and 0.01 for hull damage.

Then we should think about what our options for avoiding the accidental scenario are, given that one of the initiating events have already occurred. As “rig re-establishment” depends on the operator performing some action on the ballast system, key to such barriers is making the operator aware of the situation. One natural way to do this would be to install an alarm for indicating a dangerous ballast condition, and train the operator to respond. What is the reliability of this as a protection layer? The ballast function itself is what we are trying to set the integrity requirement for, and any response of the operator requires this system to work. Simply notifying the operator is thus necessary but not enough for us. In case the ballast system fails when the operator tries to rectify the situation, the big question is, does the operator have a second option? Such options may be a redundant ballast system, not using the same components to avoid common cause failure. In most situations the dynamics will be slow enough to permit manual operation of pumps and valves from local control panels. This is a redundant option if the operator is trained for it. If the alarm does not use the same components as the function itself, we have an independent protection layer. The reliability of this, put together with the required response of a well-trained operator cannot be credited as better than a 90% success rate in a critical situation (ref. IEC 61511, for example).

So, based on this super-simplified analysis, are we achiving our required MTTF of 1 million years?

Events per year: 0.02.

Failure in IPL: Alarm + operator response using local control panels: 0.1.

OK, se we are achieving an MTTF of:

1/(0.02 x 0.1) = 500 years.

This is pretty far from where we said we should be. First of all, this would require our ballast system to operate with better than SIL 4 performance (which is completely unrealistic), and furthermore, it includes the same operator again performing manual actions. Of course, considering how many ships are floating at sea and how few of them are sinking, this is probably a quite unrealistic picture of the real risk. Using super-simple tools for adressing complex accidental scenarios is probably not the best solution. For example, the hull penetration scenario itself has lots of complexity – penetrating a single compartment will not threaten global stability. Furthermore, the personnel will have time to analyze and act on the situation before it develops into an unrecoverable loss of stability – but the reliability of them doing so depends on a lot on their training, competence and the installation’s leadership.

The take-away points from this short discussion are three:

  • Performance of ballast systems on ships is very good due to long history and robust designs
  • Setting performance requirements based on risk analysis requires a more in-depth view of the contributing factors (initators and barriers)
  • Uncertainty in quantiative measures is very high in part due to complexity and installation specific factors, aiming for “generally accepted” technical standards is a good starting point.

Updating failure rates based on operational data – are we fooling ourselves again?

Failure rates for critical components are difficult to trust. Basically, if we look at public sources for data, such as the OREDA handbook, we observe that typical components have very wide confidence intervals for estimated failure rates, in spite of 30 years of collecting these data. If we look at the data supplied by vendors, they simply avoid saying anything about the spread or uncertainty in their data. Common practice today is to measure SIL compliance based on vendor supplied data, after a sanity check by the analyst. The sanity check usually consists of comparing with other data sources, and basically looking for completely ridiculous reliability claims. When coming to the operational phase it then becomes interesting to compare actual performance with the promised performance of the vendor. Typically, the actual performance is 10 to 100 times worse than promised. Because these components provide important parts of the barriers against terrible accidents, operators are also interested in measuring the actual integrity of these barriers. One possible method for doing this is given in the SINTEF report A8788, called “Guidelines for follow-up of safety instrumented systems (SIS) in operations”. For details, you should read that report but here’s the basics of what the report recommends for updating failure rates:

  • If you have more than 3 million operating hours for a particular type of item, you can calculate the expected failure rate as “failures observed divided by number of hours of operation” and give a confidence interval based on the chi-square-distribution
  • If you have less than 3 million operating hours but you have observed dangerous undetected failures (most likely during testing) you can combine the a priori design failure rate with an operational knowledge.

Let’s look at how to combine failure rates: first, you have to give some conservative estimate of the “true failure rate” as a basis for the combination. This is used to say something about the uncertainty in the original estimate. From OREDA one can observe that the upper 90% confidence bound is often around 2 times the failure rate value, so if no better estimate is available, use this. For very low reliability claims, use 5 x 10-7 (lower than that seems too optimistic in most cases). Then calculate the following parameters:

where λDU-CE is the conservative estimate and λDU is the design failure rate. Then, the combined failure rate can be estimated as

where is the number of similar items and is the number operational hours.

The SINTEF method does not give formulas for a confidence bound for the combined rate, but we may assume this will be between zero and the conservative estimate (which does not tell us too much, really). For the rate based on pure operational data we can use standard formulas for this. Consider now a case with about 75 transmitters with a design failure rate of 5 x 10-7 failures per hour. Over a 30-year simulated operational period we would expect approximately 10 failures. Injecting 10 failures at random intervals yields interesting results in a simulated case:

Note that up to 3 million operational hours we have assumed the design rate (PDS value) is governing uncertainty. Note also for infrequent failures, the confidence bands and the failure rate estimated is heavily influenced by each failure observation. We should thus be very careful with updating operational practices directly based on a few failure observations.

Clever phishing attempt from Nigerian scammers

This phishing e-mail landed in my work mailbox last week. This one was interesting as it was very professional and it was not obvious that it wasn’t the real thing. Here’s a snapshot of the e-mail itself:

Further, the PDF file was reasonably well formed:

Indicators that triggered the notion of a scam:

a)      I do not expect any shipment from DHL

b)      Address is a DHL UK address (real) but the copyright is DHL International GmbH, which is actually not the correct entity even for Germany.

c)       The PDF file is produced using a free converter tool not a professional publishing tool, and the logo is low-resolution raster graphics (not visible if not enlarged)

d)      The link “Here” leads to a non-DHL domain (odrillncm dot com) registered in 2015 to a user in Lagos, Nigeria… (found by a whois registry lookup)

Some quality signs:

a)      Address and phone numbers for DHL in the UK are authentic

b)      Good grammar and spelling, correct use of straplines, DHL corporate identity, etc.

c)       Name of rep “David Blair” – a semi-known British TV producer, and a common name, making it hard to verify authenticity by googling or searching on linkedIn/Facebook, etc.

The cues used to identify the scam are probably beyond “average office PC user” level, and this is most likely an identity theft attempt. This sort of phishing is also used to target ICS enviornments and this case is therefore interesting in that respect.

Solving the fragmentation problem in documentation of reliability

Reliability standards require that suppliers of components that will be used as parts in a safety function or a safety instrumented system shall be documented to show full compliance with the reliability requirements. In practice, however, documentation is often severely lacking. In essence, the documentation required for a given component would include:

  • A description of how the component will be used in the safety function, the SIS, and which barrier functions it will support
  • A description of failure rate data and calculations to show that performance is satisfactory as measured against PFD or PFH requirements for the given SIL requirement on the function the component is a part of.
  • A description of systematic capability and under which architectures the component can be used for a given SIL requirement
  • A description of software assurance to satisfy relevant requirements
  • A description of quality management and how the vendor works to avoid systematic failures
Collecting pieces of fragmented reliability information can be a tedious and painful exercise – however, not using available information may be worse for the project as a whole than accepting that things are not going to be perfect.

In many cases one or more pieces of this documentation is missing. However, the same component can be part of many deliverables; for example, a pressure transmitter may be part of various packages delivered by multiple package vendors. In some cases, these vendors deliver bits and pieces of relevant reliability documentation, that by chance would cover all of the relevant aspects. In this case, there is enough proof that the component can perform its function as part of the SIS, provided all relevant configurations are covered. In such cases, should we allow such fragmented documentation?

The principle answer would be “NO”. One reason for this is that traceability from requirement to tag number to vendor deliverable and vendor documentation will be lost. In practice, however, we are not left with much choice. If the component is acceptable to use, we should of course use it. Traceability is, however, important in reliability projects. The system integrator should thus make a summary of the documentation with pointers to where each piece of documentation is coming from. This solves the traceability problem. However, we should also take care to educate the entire value chain on the needed documentation, to ensure sufficient traceability and to allow for assurance and verification activities without resorting to hunting for bits and pieces of fragmented information about each component. We should therefore put equal weight into:

  1. Ensuring our components are of sufficient quality and proven reliability for use in the SIS
  2. Influencing our value chain to focus on continuous improvement and correct documentation in projects

When «risk reduction» kills

School shootings are not all that uncommon. In America, that is. At the same time, lots of Americans believe owning a gun can improve their safety – they buy guns to shoot at bad guys. In fact, the likelihood that your weapon is going to kill you is larger than the probability that your gun is going to kill a terrorist, bank robber or rapist. By orders of magnitude. So…. How many Americans have been killed by terrorists the last 10 years? Almost none. How many have been killed in various shootings? A lot. There is some serious disconnect going on here. In fact, vox.com has created a nice graphic to illustrate this (original here: http://www.vox.com/2015/10/1/9437187/obama-guns-terrorism-deaths).

What went wrong here? People are trying to protect themselves against a very low-frequency event by using a very dangerous tool. Human error is a dominant error cause in operation of any technology, guns included. Combined with lack of risk understanding and emergency response training, putting a gun in every paranoid ghost’s hand, is recipe for disaster. Using drastic means to curb risks can be necessary, but then the risk should warrant it. If we look at a risk matrix for any individual, and comparing a couple of situations here, we see that some risk controlling measures are best left unused.

Case 1: getting killed by terrorist. Probability: extremely small. Countermeasure: own gun and shoot terrorist. Likelihood of success of this mitigation attempt: extremely small.

Case 2: getting killed by gunman. Probabilty: relatively small. Countermeasure: reduce number of guns in society. Likelihood of success of this mitigation attempt: almost certain.

Then a single question remains: how is it possible that people do not make this connection and therefore block legislation that would reduce the number of guns around? People are prioritizing a very inefficient measure against an extremely unlikely event, instead of a very efficient measure against a likely event.

Get results by running fewer but better meetings

I’ve been to a lot of meetings – it is the battle ground of modern business. It is also where we make decisions, drive progress and get our priorities aligned. Most meetings, however, are just terrible energy drains. Bad meetings are bad for people, and they harm quality. It is not hard to claim that bad meetings are also bad for safety, if the workshops and meetings used to drive risk assessments and engineering activities are not well organized with a clear focus. Based on experience from a decade in meeting rooms, I’ve devised the following 5 rules of great meetings that I think are truly helpful.

Meetings can vary a lot in format and location - selecting the architecture of your meeting carefully is one of the rules for driving great meeting experiences.
Meetings can vary a lot in format and location – selecting the architecture of your meeting carefully is one of the rules for driving great meeting experiences.

Meeting Rule #1 If your meeting does not have a clear purpose, a specific agenda, and defined desired outcomes, the meeting shall not take place.

Meeting Rule #2 Carefully select attendees and share the purpose and agenda of your meeting with the attendees in advance, asking for feedback. Continue to foster debate and two-way interactions in the meeting.

Meeting Rule #3 Adapt architecture of meetings to the purpose, agenda and size of the meeting, by carefully selecting visual aids, meeting locations, duration and formality to fit your needs.

Meeting Rule #4 Stay close to the agenda to show that you value results, and at the same time give praise where praise is due both during the meeting and in the minutes. Make sure you make it very clear when you and your team have reached a desired outcome in your meeting.

Meeting Rule #5 Never invite to a meeting to drive outcomes you do not feel OK with from an ethical standpoint.

I’m very interested to hear what you think about these rules, and if you have other heuristics for making meetings work. I’m sure not all meetings I lead are great, but they are probably much better after I’ve realized a few things summarized in these rules, than they used be before. Tell me what you think in the comment field, or on Twitter (@sjefersuper).

Securing your control systems – what are your priorities?

Information security focuses on three aspects of safeguarding data in our systems (CIA):

  • Confidentiality: data should only be visible to those who have been granted access to them
  • Integrity: data should not be altered by people not authorized to do so
  • Availability: data should be available to everyone and all systems that need them, when the data is needed

In traditional IT, security thinking has been dominated by confidentiality. This is in most cases justified; the data itself is the valuable asset (think credit card information, medical journals, police records, accounting, business plans, etc.). In control systems, the real value is governed by the process control by the control system assets, and availability is extremely import, as well as integrity. Confidentiality on the other side may be less important.

Many organizations plan their security management based on traditional IT priorities, and apply these priorities also in the control system domain. This way, there may be a misalignment between the real priorities of the organization, and where the money and resources is spent.

Dr. Eric Cole, a renowned security expert, recommends asking senior management for these priorities, and then comparing with actual security expenditure from the last year – if there is misalignment between “what’s important” and “what’s done” it is time to take action. Have your thought through if your organization is spending the money where they are most needed to safeguard what is truly critical?

5 days of ICS & SCADA security in Amsterdam

This week I have been present at SANS ICS Security Amsterdam, taking one of their courses on security for industrial control systems. This has been a fantastic opportunity to learn new things, reinforce known concepts at a deeper level, and to network and meet with a large range of people with interests in this field. I’ve been surprised to see people from law enforcement, national security, industry representatives, consultants and vendors coming together at one event. Information security has been a lot in the news lately, and is seen as a big part of the risk picture in almost every region and every industry.

Any security training and seminar needs its own T-shirt. And soft drinks.

Two things in particular have been very interesting to see when it comes to the stuff presented by the SANS course instructors:

  1. Security researchers continue to find basic vulnerabilities in new product lines from major vendors (the big companies we’ve all heard about, I’m not going to shame anyone)
  2. A lot of control systems are still facing the internet, are directly accessible with no or very weak security, and attacks are prevalent as found in honeypot research experiments

Basically, this confirms the notion that “the situation is bad and we need to do something about it”. People said this after Stuxnet, and they are still saying it. My impression from working with various clients is that industry is aware of the risks that exist “out there” but they are in varying degree doing something to control that risk. Too many still believe that “we will never be compromised as long as we have a firewall”. Relating to this, one might ask, what are the “basic vulnerabilities” and how do we work around that?

Many control system components today run on commodity OS’s, or are connected to servers running MS Windows or Linux, e.g. used to display HMI’s. These HMI’s are in many modern systems developed as web apps (running on local servers) for portability, ease of access, etc. This means that many of the vulnerabilities found in regular IT and on the web also apply to control systems. However, these risks are worse in the control system world because these systems need to run all the time and can therefore often not be patched, and should someone break in, they could cause real physical damage (think crashing of cars, blowing up an oil rig or destroying a melting furnace). Some of the top vulnerabilities we are exposed to are the following: buffer overflows (yes, still, lots of stuff running on old systems), SQL injection vulnerabilities and cross-site scripting vulnerabilities (web interfaces…). So, if we cannot patch, what can we do about this?

First of all, perform a risk and vulnerability assessment, taking both possible scenarios and credibility of scenarios into account. Make sure to establish a good baseline security policy and use this for managing these issues – there is lots of guidance available, and often sector specific. If you cannot patch, focus on what you can do; ensure everyone involved in purchasing, maintaining, producing and using control systems are aware of the risks and what types of behaviors are good, and what types are bad. This means that security awareness must be built into the organizational culture.

On the technical side, maybe especially with lots of legacy systems running, make sure the network architecture is reasonable and safe – avoid having critical assets directly facing the internet (do a Shodan search and you will find that lots of asset owners are not following good practice here). The architecture must weight risks and business needs – a full lockdown may be the safest way to go, but it may also stop core business functions from working.

Further, should a breach occur, make sure you have the organizational and technical capabilities to deal with that. Plan and train on incident response – and remember you are not alone. Get help from vendors both in managing the assets during normal operations, and during a crisis situation. Including incident response in service agreements may thus be a good idea.

This was a quick summary of topics we’ve looked at during training, and discussed over beers in Amsterdam. The training by SANS has been excellent, and I’m looking forward to bringing reinforced and new insights back to the office on Monday.

What does the IEC 61508 requirement to have a safetey management system mean for vendors?

All companies involved in the safety lifecycle are required to have a safety management system, according to IEC 61508. What the safety management process entails for a specific project is relatively clear from the standard, and is typicaly described in an overall functional safety management plan. It is, however, much less clear from the standard what is expected for a vendor producing a component that is used in a SIS, but that is a generic product rather than a specifically designed system for one particular situation.

For vendors, the safety management system should be extensive enough to support fulfillment of all four aspects of the SIL requirement the component is targeting:

  • Quantitative requirements (PFD/PFH)
  • Semi-quantitative and architectural requirements (HWFT, SFF, etc.)
  • Software requirements
  • Qualitative requirements (quality system, avoidance of systematic failures)
A great safety management system is tailored to maintain the safety integrity level capability of the product from all four perspectives. Maintaining this integrity requires a high-reliability organization, as well as knowledgable individuals.
A great safety management system is tailored to maintain the safety integrity level capability of the product from all four perspectives. Maintaining this integrity requires a high-reliability organization, as well as knowledgable individuals.

Quite often, system integrators and system owners experience challenges working with vendors. We’ve discussed this in previous posts, e.g. follow-up of vendors. Based on experience from several sides of the table, the following parts of a safety management system are found to be essential:

  • A good system for receiving feedback and using experience data to improve the product
  • Clear role descriptions, competence requirements and a training system to make sure all employees are qualified for their roles
  • A good change management system, ensuring impact of changes is looked at from several angles
  • A quality system that ensures continuous imrovement can occur, and that such processes are documented
  • A documentation system that ensures the capabilities of the product can be documented in a trusted way, taking all changes into account in a transparent manner

A vendor that has such systems in place will have a much greater chance of delivering top quality products – than a vendor that only focuses on the technology itself. Ultra-reliable products require great organizations to stay ultra-reliable throughout the entire lifecycle.

Profiling of hackers presented at ESREL 2015

Yesterday my LR colleague Anders presented our work on aggressor profiling for use in security analysis at the European risk and reliability conference in Zürich. The approach attracted a lot of interest, also from people not working with security. One of the big challenges in integrating security assessments into existing risk management framework is how to work with the notion of probability or likelihood when considering infosec risks. Basically – we don’t know how to quality the probability of a given scenario in a reasonable manner – so how can we then risk assess it and treat it in a rational manner?

The approach presented looks exactly at this. A typical risk management process would involve risk identification, analysis and evaluation of consequences and likelihoods, planning of mitigation and follow-up/stakeholder involvement. We have found working with clients that people find identifying potential consequences of different scenarios much easier than identifying the credibility of scenarios. The approach to assessing credibilities is centered around two actors:

  1. Who is the victim of the crime?
  2. Who is the aggressor?

Given a certain victim with its financial standing, relationships to other organizations, geopolitical factors, etc., we can form an opinion about who would have any motivation to try and attack the asset. Possible categories of such attackers may be

  • Script kiddies
  • Hacktivists
  • Other corporations
  • Nation states
  • Terrorists
  • Rogue internals

Each of these stereotypes would have different traits and triggers shaping the credibility of an attack from them. This is related to motivation or intent, their resources and stamina, their skill sets and the cost-benefit ratio as seen from the bad guy perspective. Giving scores to these different traits and triggers can help establish the opinion of how credible a threat is.

An interesting effect in security is that the likelihood of a threat scenario is not necessarily decoupled from the consequence of the scenario; the motivation of the perpetrator may be reinforced by the potential gains of great damage. This should be kept in mind during considerations of intent and cost-benefit.

Forming structured opinions about this, allows us to sort threat scenarios not only according to consequences, but also according to credibility. That fits into standard risk management framworks. Somewhat simplified we can make a matrix to sort the different threat scenarios into “acceptable”, “should be looked at” and “unacceptable”.