Updating failure rates based on operational data – are we fooling ourselves again?

Failure rates for critical components are difficult to trust. Basically, if we look at public sources for data, such as the OREDA handbook, we observe that typical components have very wide confidence intervals for estimated failure rates, in spite of 30 years of collecting these data. If we look at the data supplied by vendors, they simply avoid saying anything about the spread or uncertainty in their data. Common practice today is to measure SIL compliance based on vendor supplied data, after a sanity check by the analyst. The sanity check usually consists of comparing with other data sources, and basically looking for completely ridiculous reliability claims. When coming to the operational phase it then becomes interesting to compare actual performance with the promised performance of the vendor. Typically, the actual performance is 10 to 100 times worse than promised. Because these components provide important parts of the barriers against terrible accidents, operators are also interested in measuring the actual integrity of these barriers. One possible method for doing this is given in the SINTEF report A8788, called “Guidelines for follow-up of safety instrumented systems (SIS) in operations”. For details, you should read that report but here’s the basics of what the report recommends for updating failure rates:

  • If you have more than 3 million operating hours for a particular type of item, you can calculate the expected failure rate as “failures observed divided by number of hours of operation” and give a confidence interval based on the chi-square-distribution
  • If you have less than 3 million operating hours but you have observed dangerous undetected failures (most likely during testing) you can combine the a priori design failure rate with an operational knowledge.

Let’s look at how to combine failure rates: first, you have to give some conservative estimate of the “true failure rate” as a basis for the combination. This is used to say something about the uncertainty in the original estimate. From OREDA one can observe that the upper 90% confidence bound is often around 2 times the failure rate value, so if no better estimate is available, use this. For very low reliability claims, use 5 x 10-7 (lower than that seems too optimistic in most cases). Then calculate the following parameters:

where λDU-CE is the conservative estimate and λDU is the design failure rate. Then, the combined failure rate can be estimated as

where is the number of similar items and is the number operational hours.

The SINTEF method does not give formulas for a confidence bound for the combined rate, but we may assume this will be between zero and the conservative estimate (which does not tell us too much, really). For the rate based on pure operational data we can use standard formulas for this. Consider now a case with about 75 transmitters with a design failure rate of 5 x 10-7 failures per hour. Over a 30-year simulated operational period we would expect approximately 10 failures. Injecting 10 failures at random intervals yields interesting results in a simulated case:

Note that up to 3 million operational hours we have assumed the design rate (PDS value) is governing uncertainty. Note also for infrequent failures, the confidence bands and the failure rate estimated is heavily influenced by each failure observation. We should thus be very careful with updating operational practices directly based on a few failure observations.

Clever phishing attempt from Nigerian scammers

This phishing e-mail landed in my work mailbox last week. This one was interesting as it was very professional and it was not obvious that it wasn’t the real thing. Here’s a snapshot of the e-mail itself:

Further, the PDF file was reasonably well formed:

Indicators that triggered the notion of a scam:

a)      I do not expect any shipment from DHL

b)      Address is a DHL UK address (real) but the copyright is DHL International GmbH, which is actually not the correct entity even for Germany.

c)       The PDF file is produced using a free converter tool not a professional publishing tool, and the logo is low-resolution raster graphics (not visible if not enlarged)

d)      The link “Here” leads to a non-DHL domain (odrillncm dot com) registered in 2015 to a user in Lagos, Nigeria… (found by a whois registry lookup)

Some quality signs:

a)      Address and phone numbers for DHL in the UK are authentic

b)      Good grammar and spelling, correct use of straplines, DHL corporate identity, etc.

c)       Name of rep “David Blair” – a semi-known British TV producer, and a common name, making it hard to verify authenticity by googling or searching on linkedIn/Facebook, etc.

The cues used to identify the scam are probably beyond “average office PC user” level, and this is most likely an identity theft attempt. This sort of phishing is also used to target ICS enviornments and this case is therefore interesting in that respect.

Solving the fragmentation problem in documentation of reliability

Reliability standards require that suppliers of components that will be used as parts in a safety function or a safety instrumented system shall be documented to show full compliance with the reliability requirements. In practice, however, documentation is often severely lacking. In essence, the documentation required for a given component would include:

  • A description of how the component will be used in the safety function, the SIS, and which barrier functions it will support
  • A description of failure rate data and calculations to show that performance is satisfactory as measured against PFD or PFH requirements for the given SIL requirement on the function the component is a part of.
  • A description of systematic capability and under which architectures the component can be used for a given SIL requirement
  • A description of software assurance to satisfy relevant requirements
  • A description of quality management and how the vendor works to avoid systematic failures
Collecting pieces of fragmented reliability information can be a tedious and painful exercise – however, not using available information may be worse for the project as a whole than accepting that things are not going to be perfect.

In many cases one or more pieces of this documentation is missing. However, the same component can be part of many deliverables; for example, a pressure transmitter may be part of various packages delivered by multiple package vendors. In some cases, these vendors deliver bits and pieces of relevant reliability documentation, that by chance would cover all of the relevant aspects. In this case, there is enough proof that the component can perform its function as part of the SIS, provided all relevant configurations are covered. In such cases, should we allow such fragmented documentation?

The principle answer would be “NO”. One reason for this is that traceability from requirement to tag number to vendor deliverable and vendor documentation will be lost. In practice, however, we are not left with much choice. If the component is acceptable to use, we should of course use it. Traceability is, however, important in reliability projects. The system integrator should thus make a summary of the documentation with pointers to where each piece of documentation is coming from. This solves the traceability problem. However, we should also take care to educate the entire value chain on the needed documentation, to ensure sufficient traceability and to allow for assurance and verification activities without resorting to hunting for bits and pieces of fragmented information about each component. We should therefore put equal weight into:

  1. Ensuring our components are of sufficient quality and proven reliability for use in the SIS
  2. Influencing our value chain to focus on continuous improvement and correct documentation in projects

When «risk reduction» kills

School shootings are not all that uncommon. In America, that is. At the same time, lots of Americans believe owning a gun can improve their safety – they buy guns to shoot at bad guys. In fact, the likelihood that your weapon is going to kill you is larger than the probability that your gun is going to kill a terrorist, bank robber or rapist. By orders of magnitude. So…. How many Americans have been killed by terrorists the last 10 years? Almost none. How many have been killed in various shootings? A lot. There is some serious disconnect going on here. In fact, vox.com has created a nice graphic to illustrate this (original here: http://www.vox.com/2015/10/1/9437187/obama-guns-terrorism-deaths).

What went wrong here? People are trying to protect themselves against a very low-frequency event by using a very dangerous tool. Human error is a dominant error cause in operation of any technology, guns included. Combined with lack of risk understanding and emergency response training, putting a gun in every paranoid ghost’s hand, is recipe for disaster. Using drastic means to curb risks can be necessary, but then the risk should warrant it. If we look at a risk matrix for any individual, and comparing a couple of situations here, we see that some risk controlling measures are best left unused.

Case 1: getting killed by terrorist. Probability: extremely small. Countermeasure: own gun and shoot terrorist. Likelihood of success of this mitigation attempt: extremely small.

Case 2: getting killed by gunman. Probabilty: relatively small. Countermeasure: reduce number of guns in society. Likelihood of success of this mitigation attempt: almost certain.

Then a single question remains: how is it possible that people do not make this connection and therefore block legislation that would reduce the number of guns around? People are prioritizing a very inefficient measure against an extremely unlikely event, instead of a very efficient measure against a likely event.

Get results by running fewer but better meetings

I’ve been to a lot of meetings – it is the battle ground of modern business. It is also where we make decisions, drive progress and get our priorities aligned. Most meetings, however, are just terrible energy drains. Bad meetings are bad for people, and they harm quality. It is not hard to claim that bad meetings are also bad for safety, if the workshops and meetings used to drive risk assessments and engineering activities are not well organized with a clear focus. Based on experience from a decade in meeting rooms, I’ve devised the following 5 rules of great meetings that I think are truly helpful.

Meetings can vary a lot in format and location - selecting the architecture of your meeting carefully is one of the rules for driving great meeting experiences.
Meetings can vary a lot in format and location – selecting the architecture of your meeting carefully is one of the rules for driving great meeting experiences.

Meeting Rule #1 If your meeting does not have a clear purpose, a specific agenda, and defined desired outcomes, the meeting shall not take place.

Meeting Rule #2 Carefully select attendees and share the purpose and agenda of your meeting with the attendees in advance, asking for feedback. Continue to foster debate and two-way interactions in the meeting.

Meeting Rule #3 Adapt architecture of meetings to the purpose, agenda and size of the meeting, by carefully selecting visual aids, meeting locations, duration and formality to fit your needs.

Meeting Rule #4 Stay close to the agenda to show that you value results, and at the same time give praise where praise is due both during the meeting and in the minutes. Make sure you make it very clear when you and your team have reached a desired outcome in your meeting.

Meeting Rule #5 Never invite to a meeting to drive outcomes you do not feel OK with from an ethical standpoint.

I’m very interested to hear what you think about these rules, and if you have other heuristics for making meetings work. I’m sure not all meetings I lead are great, but they are probably much better after I’ve realized a few things summarized in these rules, than they used be before. Tell me what you think in the comment field, or on Twitter (@sjefersuper).

Securing your control systems – what are your priorities?

Information security focuses on three aspects of safeguarding data in our systems (CIA):

  • Confidentiality: data should only be visible to those who have been granted access to them
  • Integrity: data should not be altered by people not authorized to do so
  • Availability: data should be available to everyone and all systems that need them, when the data is needed

In traditional IT, security thinking has been dominated by confidentiality. This is in most cases justified; the data itself is the valuable asset (think credit card information, medical journals, police records, accounting, business plans, etc.). In control systems, the real value is governed by the process control by the control system assets, and availability is extremely import, as well as integrity. Confidentiality on the other side may be less important.

Many organizations plan their security management based on traditional IT priorities, and apply these priorities also in the control system domain. This way, there may be a misalignment between the real priorities of the organization, and where the money and resources is spent.

Dr. Eric Cole, a renowned security expert, recommends asking senior management for these priorities, and then comparing with actual security expenditure from the last year – if there is misalignment between “what’s important” and “what’s done” it is time to take action. Have your thought through if your organization is spending the money where they are most needed to safeguard what is truly critical?

5 days of ICS & SCADA security in Amsterdam

This week I have been present at SANS ICS Security Amsterdam, taking one of their courses on security for industrial control systems. This has been a fantastic opportunity to learn new things, reinforce known concepts at a deeper level, and to network and meet with a large range of people with interests in this field. I’ve been surprised to see people from law enforcement, national security, industry representatives, consultants and vendors coming together at one event. Information security has been a lot in the news lately, and is seen as a big part of the risk picture in almost every region and every industry.

Any security training and seminar needs its own T-shirt. And soft drinks.

Two things in particular have been very interesting to see when it comes to the stuff presented by the SANS course instructors:

  1. Security researchers continue to find basic vulnerabilities in new product lines from major vendors (the big companies we’ve all heard about, I’m not going to shame anyone)
  2. A lot of control systems are still facing the internet, are directly accessible with no or very weak security, and attacks are prevalent as found in honeypot research experiments

Basically, this confirms the notion that “the situation is bad and we need to do something about it”. People said this after Stuxnet, and they are still saying it. My impression from working with various clients is that industry is aware of the risks that exist “out there” but they are in varying degree doing something to control that risk. Too many still believe that “we will never be compromised as long as we have a firewall”. Relating to this, one might ask, what are the “basic vulnerabilities” and how do we work around that?

Many control system components today run on commodity OS’s, or are connected to servers running MS Windows or Linux, e.g. used to display HMI’s. These HMI’s are in many modern systems developed as web apps (running on local servers) for portability, ease of access, etc. This means that many of the vulnerabilities found in regular IT and on the web also apply to control systems. However, these risks are worse in the control system world because these systems need to run all the time and can therefore often not be patched, and should someone break in, they could cause real physical damage (think crashing of cars, blowing up an oil rig or destroying a melting furnace). Some of the top vulnerabilities we are exposed to are the following: buffer overflows (yes, still, lots of stuff running on old systems), SQL injection vulnerabilities and cross-site scripting vulnerabilities (web interfaces…). So, if we cannot patch, what can we do about this?

First of all, perform a risk and vulnerability assessment, taking both possible scenarios and credibility of scenarios into account. Make sure to establish a good baseline security policy and use this for managing these issues – there is lots of guidance available, and often sector specific. If you cannot patch, focus on what you can do; ensure everyone involved in purchasing, maintaining, producing and using control systems are aware of the risks and what types of behaviors are good, and what types are bad. This means that security awareness must be built into the organizational culture.

On the technical side, maybe especially with lots of legacy systems running, make sure the network architecture is reasonable and safe – avoid having critical assets directly facing the internet (do a Shodan search and you will find that lots of asset owners are not following good practice here). The architecture must weight risks and business needs – a full lockdown may be the safest way to go, but it may also stop core business functions from working.

Further, should a breach occur, make sure you have the organizational and technical capabilities to deal with that. Plan and train on incident response – and remember you are not alone. Get help from vendors both in managing the assets during normal operations, and during a crisis situation. Including incident response in service agreements may thus be a good idea.

This was a quick summary of topics we’ve looked at during training, and discussed over beers in Amsterdam. The training by SANS has been excellent, and I’m looking forward to bringing reinforced and new insights back to the office on Monday.

What does the IEC 61508 requirement to have a safetey management system mean for vendors?

All companies involved in the safety lifecycle are required to have a safety management system, according to IEC 61508. What the safety management process entails for a specific project is relatively clear from the standard, and is typicaly described in an overall functional safety management plan. It is, however, much less clear from the standard what is expected for a vendor producing a component that is used in a SIS, but that is a generic product rather than a specifically designed system for one particular situation.

For vendors, the safety management system should be extensive enough to support fulfillment of all four aspects of the SIL requirement the component is targeting:

  • Quantitative requirements (PFD/PFH)
  • Semi-quantitative and architectural requirements (HWFT, SFF, etc.)
  • Software requirements
  • Qualitative requirements (quality system, avoidance of systematic failures)
A great safety management system is tailored to maintain the safety integrity level capability of the product from all four perspectives. Maintaining this integrity requires a high-reliability organization, as well as knowledgable individuals.
A great safety management system is tailored to maintain the safety integrity level capability of the product from all four perspectives. Maintaining this integrity requires a high-reliability organization, as well as knowledgable individuals.

Quite often, system integrators and system owners experience challenges working with vendors. We’ve discussed this in previous posts, e.g. follow-up of vendors. Based on experience from several sides of the table, the following parts of a safety management system are found to be essential:

  • A good system for receiving feedback and using experience data to improve the product
  • Clear role descriptions, competence requirements and a training system to make sure all employees are qualified for their roles
  • A good change management system, ensuring impact of changes is looked at from several angles
  • A quality system that ensures continuous imrovement can occur, and that such processes are documented
  • A documentation system that ensures the capabilities of the product can be documented in a trusted way, taking all changes into account in a transparent manner

A vendor that has such systems in place will have a much greater chance of delivering top quality products – than a vendor that only focuses on the technology itself. Ultra-reliable products require great organizations to stay ultra-reliable throughout the entire lifecycle.

Profiling of hackers presented at ESREL 2015

Yesterday my LR colleague Anders presented our work on aggressor profiling for use in security analysis at the European risk and reliability conference in Zürich. The approach attracted a lot of interest, also from people not working with security. One of the big challenges in integrating security assessments into existing risk management framework is how to work with the notion of probability or likelihood when considering infosec risks. Basically – we don’t know how to quality the probability of a given scenario in a reasonable manner – so how can we then risk assess it and treat it in a rational manner?

The approach presented looks exactly at this. A typical risk management process would involve risk identification, analysis and evaluation of consequences and likelihoods, planning of mitigation and follow-up/stakeholder involvement. We have found working with clients that people find identifying potential consequences of different scenarios much easier than identifying the credibility of scenarios. The approach to assessing credibilities is centered around two actors:

  1. Who is the victim of the crime?
  2. Who is the aggressor?

Given a certain victim with its financial standing, relationships to other organizations, geopolitical factors, etc., we can form an opinion about who would have any motivation to try and attack the asset. Possible categories of such attackers may be

  • Script kiddies
  • Hacktivists
  • Other corporations
  • Nation states
  • Terrorists
  • Rogue internals

Each of these stereotypes would have different traits and triggers shaping the credibility of an attack from them. This is related to motivation or intent, their resources and stamina, their skill sets and the cost-benefit ratio as seen from the bad guy perspective. Giving scores to these different traits and triggers can help establish the opinion of how credible a threat is.

An interesting effect in security is that the likelihood of a threat scenario is not necessarily decoupled from the consequence of the scenario; the motivation of the perpetrator may be reinforced by the potential gains of great damage. This should be kept in mind during considerations of intent and cost-benefit.

Forming structured opinions about this, allows us to sort threat scenarios not only according to consequences, but also according to credibility. That fits into standard risk management framworks. Somewhat simplified we can make a matrix to sort the different threat scenarios into “acceptable”, “should be looked at” and “unacceptable”.

Managing information security should be a natural part of your risk management system

Anyone not hiding in a hole in the ground the last five years must have noticed how much media is writing about cyber threats. The media picture aligns well with the impression I get when talking to clients; business risks for production centered firms tend to no longer be dominated only by random accidents – but also by potential cyber threats. This means that such threats need to be managed.

Most companies already have a risk management system in place, typically along the lines of ISO 30001. A risk management process must consist of a cycle of activities;

  • Identify risks
  • Evaluate risks up against acceptance criteria for severity and probability
  • Plan mitigating actions
  • Involve internal and external stakeholders along the way

Exactly the same process can be used to deal with information security risks. A good methodology for managing such security risks should take into account who the threat actors are in the context of the business. Why do they want to attack? Do they have a lot of resources and know-how? What is the cost-benefit ratio for the bad guys? Understanding these human factors in the equation at play will arm you with a reasonable way to rank the credibility of various attack scenarios. This helps put cyber risks into the typical risk management scenarios; resources should be spent where they bring the largest risk reduction. Typically, highly credible attack scenarios with terrible consequences should be dealt with first, less likely or less severe scenarios second, and scenarios that seem extremely unlikely or have no or little impact on our business are perhaps OK. This sort of risk ranking is well-known to risk management professionals. A few differences on assessing credibility of attack scenarios from more random events are the following;

  • A bad guy may be motivated by worse consequences; the probability of an attack is thus not decoupled from the consequence
  • It is hard to use “probabilities” or “frequencies” in a meaningful sense – a qualitative approach may be just as useful (sometimes true also for random risks!)

Mitigations should thus be planned according to risk reduction needs and effects of the mitigation approaches. This is exactly the same as we do in other risk management settings. Also, communication with stakeholders is equally necessary in this context, if not even more. Risk owners, equipment suppliers, users, other supply chain partners – they can all be affected. And in our connected world, and increasingly so as we move to smarter production systems, cross-infection may be possible across various domains and interfaces. The tools of communication well-known to risk managers are therefore equally important when managing risks to production critical information systems.

Managing infosec requires a keen eye for the context of the business, and adaptability to new realities is a key success factor

The bottom line is: don’t outsource responsibility for risk management to your IT services provider – integrate cyber risk management into your existing risk management process. That is the only way to be in control of your own environment.