IEC 61511 Security – getting the right detail level

March 2, 2017 Håkon OlsenLeave a comment

When performing the risk and vulnerability assessment required by the new IEC 61511 standard, make sure the level of detail is just right for your application. Normally the system integrator is operating at the architectural level, meaning signal validation in software components should probably already have been dealt with. On the other hand, upgrading and maintaining the system during the entire lifecycle has to be looked into. Just enough detail can be hard to aim for but digging too deep is costly, and being too shallow doesn’t help your decision making. Therefore, planning the security assessment depth level already from the beginning should be a priority!

Starting with the context – having the end in mind

The purpose of including cybersecurity requirements in a safety instrumented system design is to make sure the reliability of the system is not threatened by security incidents. That reliability requires each safety instrumented function (SIF) to perform its intended task at the right moment; we are concerned with the availability and the integrity of the system.

The probability of failure on demand for a safety critical function usually depends on random error distributions and testing regimes. How can hacker threats be included in the thinking around reliability engineering? The goal is to remain confident in the reliability calculations, so that quantitative risk calculations are still meaningful.

In order to understand the threats to your system you need to start with the company and its place in the world, and in the supply chain. What does the company do? Consider an oil producer active in a global upstream market – producing offshore, onshore, as well as from unconventional sources such as tar-sands, arctic fields and shale oil. The company is also investing heavily in Iraq, including areas recently captured from ISIS. Furthermore, on the owner side of this company you find a Russian oligarch, who is known to be close to the Kremlin, as a majority stock holder. The firm is listed on the Hong Kong stock Market. Its key suppliers are Chinese engineering firms and steel producers, and its top customers are also Chinese government-backed companies. How does all of this affect the threat landscape as it applies to this firm?

The firm is interfering with causes that may trigger the interest of hacktivists:

Unconventional oil production
Arctic oil production

It also operates in an area that can make them a target for terrorist groups, in one of the most politically unstable regions in the world, where the world’s largest military powers also have to some degree opposing interests. This could potentially draw the interest of both terrorist groups and of nation state hackers. It is also worth noting that the company is on good terms with both the Russian and Chinese governments, two countries often accused of using state sponsored hackers to target companies in the west. The largest nation state threat to this oil company may thus be from western countries, including the one headed by Donald Trump. He has been quite silent on cybersecurity after taking office but issued statements during his campaign in 2016 hinting at more aggressive build-ups of offensive capacities. So, the company itself should at least expect the interest of script kiddies, hacktivists, cybercriminals, terrorists, nation states and insiders. These groups have quite varying capacities and the SIS is typically hard to get at due to multiple firewalls and network segregations. Our main focus should thus be of hacktivists, terrorists and nation states – with cybercriminals and insiders acting as proxies (knowingly or not).

The end in mind: keeping safety-critical systems reliable also under attack, or at least make it an insignificant contribution to unreliability.

Granularity of security assessment

Our goal of this discussion was to find the right depth level for risk and vulnerability assessments under IEC 61511. If we start with the threat actors and their capabilities, we observe some interesting issues:

Nation states: capable of injecting unknown features into firmware and application software at the production stage, including human infiltration of engineering teams. This can also be “features” sanctioned by the producer in some countries. Actual operations can include cyberphysical incursions with real asset destruction.
Terrorists: infiltration of vendors less likely. Typical capabilities are ATP’s using phishing to break the attack surface, and availability attacks through DDoS provided the SIS can be reached. Physical attack is also highly likely.
Cybercriminals: similar to terrorists, but may also have more advanced capabilities. Can also act out of own interest, e.g. through extortion schemes.
Hacktivists: unlikely to threaten firmware and software integrity. Not likely to desire asset damage as that can easily lead to pollution, which is in conflict with their likely motivations. DDoS attacks can be expected, SIS usually not exposed.

Some of these actors have serious capabilities, and it is possible that they will be used if the political climate warrants this. As we are most likely relying on procured systems form established vendors, using limited variability languages for the SIS, we have little influence over the low-level software engineering. Configurations, choice of blocks and any inclusion of custom-designed software blocks is another story. Regarding our assessment we should thus, at least, include the following aspects:

Procurement – setting security requirements and general information security requirements, and managing the follow-up process and cross-organizational competence management.
Software components – criticality assessment. Extra testing requirements to vendors. Risk assessment including configuration items.
Architectural security – network segregation, attack surface exposure, monitoring, security technologies, responsible organizations and network operations
Hardware – tampering risk, exposure to physical attacks, ports and access points, network access points including wireless (VSAT, microwave, GSM, WiFi)
Organizational security risks: project organization, operations organization. Review of roles and responsibilities, criticality of key personnel, workload aspects, contractual interfaces, third-party personnel.

Summary

This post does not give a general procedure for depth of analysis decisions but it does outline important factors. Always start with the context to judge both impact and expected actions from threat actors. Use this to determine capabilities of the main threat actors. This will help you decide the granularity level of your assessment. The things that are outside of your control should also not be neglected by considered an uncertainty point that may influence the necessary security controls you need to put in place.

A sketch of key factors to include when deciding on the granularity for a cybersecurity risk assessment under IEC 61511

Thinking about risk through methods

January 23, 2017 Håkon OlsenLeave a comment

Risk management is a topic with a large number of methods. Within the process industries, semi-quantitative methods are popular, in particular for determining required SIL for safety instrumented functions (automatic shutdowns, etc.). Two common approaches are known as LOPA, which is short for “layers of protection analysis” and Riskgraph. These methods are sometimes treated as “holy” by practicioners, but truth is that they are merely coginitive aids in sorting through our thinking about risks.

13768080_1067660556651700_1098663203_n — Riskgraph #sliderule – methods are formalisms. See picture on Instagram

In short, our risk assessment process consists of a series of steps here:

Identify risk scenarios
Find out what can reduce the risk that you have in place, like design features and procedures
Determine what the potential consequences of the scenario at hand is, e.g. worker fatalities or a major environmental disaster
Make an estimate of how likely or credible you think it is that the risk scenario should occur
Consider how much you trust the existing barriers to do the job
Determine how trustworthy your new barrier must be for the situation to be acceptable

Several of these bullet points can be very difficult tasks alone, and putting together a risk picture that allows you to make sane decisions is hard work. That’s why we lean on methods, to help us make sense of the mess that discussions about risk typically lead to.

Consequences can be hard to gauge, and one bad situation may lead to a set of different outcomes. Think about the risk of “falling asleep while driving a car”. Both of these are valid consequences that may occur:

You drive off the road and crash in the ditch – moderate to serious injuries
You steer the car into the wrong lane and crash head-on with a truck – instant death

Should you think about both, or pick one of them, or another consequence not on this list? In many “barrier design” cases the designer chooses to design for the worst-case credible consequence. It may be difficult to judge what is really credible, and what is truly the worst-case. And is this approach sound if the worst-case is credible but still quite unlikeley, while at the same time you have relatively likely scenarios with less serious outcomes? If you use a method like LOPA or RiskGraph, you may very well have a statement in your method description to always use the worst-case consequence. A bit of judgment and common sense is still a good idea.

Another difficult topic is probability, or credibility. How likely is it that an initiating event should occur, and what is the initating event in the first place? If you are the driver of the car, is “falling asleep behind the wheel” the initating event? Let’s say it is. You can definitely find statistics on how often people fall asleep behind the wheel. The key question is, is this applicable to the situation at hand? Are data from other countries applicable? Maybe not, if they have different road standards, different requirements for getting a driver’s license, etc. Personal or local factors can also influence the probability. In the case of the driver falling asleep, the probabilities would be influenced by his or her health, stress levels, maintenance of the car, etc. Bottom line is, also the estimate of probability will be a judgment call in most cases. If you are lucky enough to have statistical data to lean on, make sure you validate that the data are representative for your situation.Good method descriptions should also give guidance on how to do these judgment calls.

Most risks you identify already have some risk reducing barrier elements. These can be things like alarms and operating procedures, and other means to reduce the likelihood or consequence of escalation of the scenario. Determining how much you are willing to rely on these other barriers is key to setting a requirement on your safety function of interest – typically a SIL rating. Standards limit how much you can trust certain types of safeguards, but also here there will be some judgment involved. Key questions are:

Are multiple safeguards really independent, such that the same type of failure cannot know out multiple defenses at once?
How much trust can you put in each safeguard?
Are there situations where the safeguards are less trustworthy, e.g. if there are only summer interns available to handle a serious situation that requires experience and leadership?

Risk assessmen methods are helpful but don’t forget that you make a lot of assumptions when you use them. Don’t forget to question your assumptions even if you use a recognized method, especially not if somebody’s life will depend on your decision.

SIL and ballast systems

November 23, 2015November 23, 2015 Håkon OlsenLeave a comment

Working on floating oil and gas facilities, one question keeps popping up about ballast systems. Should they have SIL requirements, and what should in this case the requirements be? When seeking to establish requirements for such systems, several issues are uncovered. First of all, current designs of ballast systems are very robust due to evolution of designs and requirements in shipping over a long time. Further, the problem is much more complex than collecting a few well-defined failure modes with random error data leading to a given situation, as typically seen in may process industry type problem descriptions. This complexity depends on a number of factors, and some of them are specific to each ship or installation, such as location, ship traffic density or operating practices of personnel onboard. Therefore, any quantitative estimates of “error probabilities” contributing to an expected return frequency of critical events concerning the system will have significant uncertainties associated with them.

ship-storm1

A ballast system is used to maintain the stability of a ship or a floating hull structure under varying cargo loading conditions and in various sea conditions and ship drafts. Water is kept in tanks dispersed around the hull structure, and can be pumped in or out, or transfered between tanks, to maintain stability. Errors in ballasting operations can lead to loss of stability, which in the worst consequence means a sunken ship. The ballasting operation is normally a semi manual operation where a marine operator would use a loading computer to guide decisions about ballasting, and manually give commands to a computer based control system on where to transfer water into or out of a particular ballast tank. Because this is such a critical safety system it is a natural question to ask: “what are the performance requirements?”.

Ballast systems have been part of shipping for hundreds of years. Requirements for ballast systems are thus set in the classification rules of ship classification societies, such as Lloyd’s Register, DNV GL or ABS. These requirements are typically presecriptive in nature and focus on robustness and avoidance of common cause failures in the technology. Maritime classification societies do not refer to safety integrity levels but rely on other means of ensuring safey operation and reliability. Society has accepted this practice for years, for very diverse vessels ranging from oil tankers to passenger cruise ships.

In oil and gas operations, the use of safety integrity levels to establish performance requirements for instrumented safety functions is the norm, and standards such as IEC 61508 are used as the point of reference. The Norwegian Oil and Gas Association has made a guideline that is normally applied for installations in Norwegian waters, which offers a simplification of requirements setting based on “typical performance”. This guideline can be freely downloaded from this web page. This guideline states that for “start of ballasting for rig re-establishment”, ths system should conform to a SIL 1 requirement. The “system” is described as consisting of a ballast control node, 2 x 100% pumps and three ballast valves. In appendix A.12 of the guideline a description of this “sub-function” is given with a calculation of achievable performance.

It may be argued that this functional description is somewhat artificial because the ballast system on a production installation is normally operated more or less continously. The function is defined for a single ballast tank/compartment, irrespective of the number of tanks and the necessary load balancing for re-establishing stability. The Guideline 070 approach is based on “typical performance” of the safety system as it is defined, and is not linked directly to the required risk reduction provided from the system. Multiple approaches may be taken to assign safety integrity levels based on risk analysis, see for example IEC 61508. One such method that is particularly common in the process industries and the oil and gas industry is “layers or protection analysis”, or LOPA for short. In this type of study, multiple initating events can contribute to one hazard situation, for example “sunken ship due to loss of stability”. Multiple barriers or “independent protection layers” can be credited for reducing the risk of this hazard being realized. In order to use a risk based method for setting the integrity requirement, it is necessary to define what is an acceptable frequency of this event occurring. Let us say for the sake of the discussion that it is acceptable that the mean time between each “sunken ship due to loss of stability” is 1 million years. How can we reason about this to establish requirements for the ballast system? The functional requirement is that we should “be able to shift ballast loading to re-establish stability before condition is made unrecoverable”. In order to start analyzing this situation, we need to estimate how often we will have a condition that can lead to such an unrecoverable situation if not correctly managed. Let us consider three such “initiating events”:

Loading operator error during routine ballasting (human error)
Damage to hull due to external impact
Error in load computer calculations

Both of these situations depend on a number of factors. The probability that the loading operator will perform an erronous situation depends on stress levels, competence/training and management factors. A throrough analysis using “human reliability analysis” can be performed, or a more simplified approach may be taken. We may, for example, make the assumption that the average operator makes 1 error without noticing immediately every 100 years (this is an assumption – must be validated if used).

Damage to hull due to external impact would depend on the ship traffic density in the area, if there is a difficult political condition (war, etc.), or if you are operating in arctic environments where ice impact is likely (think Titanic). Again, you may do extensive analysis to establish such data, or make some assumptions based on expert judgment. For example, we may assume a penetrating ship collition every 100 years on average.

What about erros in load computer calculations? Do the operators trust the load computer blindly, or do they perform sanity checks? How was the load computer programmed? Is the software mature? Is the loading condition unusual? Many questions may be asked here as well. For the sake of this example, let us assume there is no contribution from the loading computer.

We are then looking at an average initiating event frequency of 0.1 for human errors and 0.01 for hull damage.

Then we should think about what our options for avoiding the accidental scenario are, given that one of the initiating events have already occurred. As “rig re-establishment” depends on the operator performing some action on the ballast system, key to such barriers is making the operator aware of the situation. One natural way to do this would be to install an alarm for indicating a dangerous ballast condition, and train the operator to respond. What is the reliability of this as a protection layer? The ballast function itself is what we are trying to set the integrity requirement for, and any response of the operator requires this system to work. Simply notifying the operator is thus necessary but not enough for us. In case the ballast system fails when the operator tries to rectify the situation, the big question is, does the operator have a second option? Such options may be a redundant ballast system, not using the same components to avoid common cause failure. In most situations the dynamics will be slow enough to permit manual operation of pumps and valves from local control panels. This is a redundant option if the operator is trained for it. If the alarm does not use the same components as the function itself, we have an independent protection layer. The reliability of this, put together with the required response of a well-trained operator cannot be credited as better than a 90% success rate in a critical situation (ref. IEC 61511, for example).

So, based on this super-simplified analysis, are we achiving our required MTTF of 1 million years?

Events per year: 0.02.

Failure in IPL: Alarm + operator response using local control panels: 0.1.

OK, se we are achieving an MTTF of:

1/(0.02 x 0.1) = 500 years.

This is pretty far from where we said we should be. First of all, this would require our ballast system to operate with better than SIL 4 performance (which is completely unrealistic), and furthermore, it includes the same operator again performing manual actions. Of course, considering how many ships are floating at sea and how few of them are sinking, this is probably a quite unrealistic picture of the real risk. Using super-simple tools for adressing complex accidental scenarios is probably not the best solution. For example, the hull penetration scenario itself has lots of complexity – penetrating a single compartment will not threaten global stability. Furthermore, the personnel will have time to analyze and act on the situation before it develops into an unrecoverable loss of stability – but the reliability of them doing so depends on a lot on their training, competence and the installation’s leadership.

The take-away points from this short discussion are three:

Performance of ballast systems on ships is very good due to long history and robust designs
Setting performance requirements based on risk analysis requires a more in-depth view of the contributing factors (initators and barriers)
Uncertainty in quantiative measures is very high in part due to complexity and installation specific factors, aiming for “generally accepted” technical standards is a good starting point.

What does the IEC 61508 requirement to have a safetey management system mean for vendors?

September 20, 2015September 20, 2015 Håkon OlsenLeave a comment

All companies involved in the safety lifecycle are required to have a safety management system, according to IEC 61508. What the safety management process entails for a specific project is relatively clear from the standard, and is typicaly described in an overall functional safety management plan. It is, however, much less clear from the standard what is expected for a vendor producing a component that is used in a SIS, but that is a generic product rather than a specifically designed system for one particular situation.

For vendors, the safety management system should be extensive enough to support fulfillment of all four aspects of the SIL requirement the component is targeting:

Quantitative requirements (PFD/PFH)
Semi-quantitative and architectural requirements (HWFT, SFF, etc.)
Software requirements
Qualitative requirements (quality system, avoidance of systematic failures)

A great safety management system is tailored to maintain the safety integrity level capability of the product from all four perspectives. Maintaining this integrity requires a high-reliability organization, as well as knowledgable individuals.

Quite often, system integrators and system owners experience challenges working with vendors. We’ve discussed this in previous posts, e.g. follow-up of vendors. Based on experience from several sides of the table, the following parts of a safety management system are found to be essential:

A good system for receiving feedback and using experience data to improve the product
Clear role descriptions, competence requirements and a training system to make sure all employees are qualified for their roles
A good change management system, ensuring impact of changes is looked at from several angles
A quality system that ensures continuous imrovement can occur, and that such processes are documented
A documentation system that ensures the capabilities of the product can be documented in a trusted way, taking all changes into account in a transparent manner

A vendor that has such systems in place will have a much greater chance of delivering top quality products – than a vendor that only focuses on the technology itself. Ultra-reliable products require great organizations to stay ultra-reliable throughout the entire lifecycle.

Why functional safety audits are useful

August 18, 2015August 18, 2015 Håkon Olsen3 Comments

Functional safety work usually involves a lot of people, and multiple organizations. One key success factor for design and operation of safety instrumented systems is the competence of the people involved in the safety lifecycle. In practice, when activities have been omitted, or the quality of the work is not acceptable, this is discovered in the functional safety assessment towards the end of the project, or worse, it is not discovered at all. The result of too low quality is lower integrity of the SIS as a barrier, and thus higher risk to people, assets and environment – without the asset owner being aware of this! Obviously a bad situation.

Knowing how to do your job is important in all phases of the safety lifecycle. Functional safety audits can be an important tool for verification – and for motivating organization’s to maintain good competence mangement systems for all relevant roles.

Competence management is a key part of functional safety management. In spite of this, many companies have less than desirable track records in this field. This may be due to ignorance, or maybe because some organizations view a «SIL» as a marketing designation rather than a real risk reduction measure. Either way – such a situation is unacceptable. One key tool for ensuring everybody involved understands what their responsibilities are, and makes an effort to learn what they need to know to actually secure the necessary system level integrity, is the use of functional safety audits. An auditing program should be available in all functional safety projects, with at least the following aspects:

A procedure for functional safety audits should exist
An internal auditing program should exist within each company involved in the safety lifecycle
Vendor auditing should be used to make sure suppliers are complying with functional safety requirements
All auditing programs should include aspects related to document control, management of change and competence management

Constructive auditing can be an invaluable part of building a positive organizational culture – where quality becomes as important to every function involved in the value chain – from the sales rep to the R&D engineer.

One day statements like “please take the chapter on competence out of the management plan, we don’t want any difficult questions about systems we do not have” may seem like an impossible absurdity.

Do you trust your numbers?

July 11, 2015 Håkon OlsenLeave a comment

People tend to rely more on numbers than on other types of «proof» of goodness. Where those numbers come from seems to play less of a role. Of course, a number picked out of thin air is just as worthless as a Greek government bond – but why do we then seem to trust a promise, as long as someone has put a number on it? Several people have discussed this previously and in many settings before, but one of my favorites is this blog post at the American Mathematical Society from 2012 by Jean Joseph.

This tendency of “everything is fine because the numbers say so” thinking is very much present in functional safety; over focus on probability calculations is common. I believe there are several reasons for this. First, engineers like quantitative measures – and there are good and sound methodologies for performing reliability calculations. We tend to trust what numbers say more than qualitative information that we perceive to be less accurate.

A SIL requirement consists of four types of requirements – the practical implications of which depending on the integrity level sought. The four types of requirements are illustrated below.

The quantitative requirements are probability calculations. We tend to overfocus on these at the expense of the others. The quality of these calculations depends on the quality of the input data (failure rates) – and the quality of such data can be very hard to verify.

Semi-quantitative requirements are in most cases expressed as the required redundancy (hardware fault tolerance) and the safe failure fraction. To build in the necessary robustness in a safety function, redundancy is required to ensure a single failure does not lead to a dangerous failure of the safety function. The required redundancy depends on the SIL of the function, as well as the fraction of failures that will lead to a safe state directly (the so-called safe failure fraction, SFF). In practice, we see somewhat less focus on this than on the probability calculations themselves (PFD).

Software requirements depend on the required SIL and the type of software development involved. Software competence among system users and system integrators is typically lower than their hardware competence. This causes the software requirement setting and compliance assessment to be delegated to the software vendor without much oversight from the integrator or user. This is a competence-based weakness in the lifecycle in such cases that we cannot capture in the numbers we calculate.

Qualitative requirements include how we work with the SIS development process itself, including managing changes, and ensuring systematic errors are not introduced. An important part of this work and the requirements we need to meet is to ensure that personnel competent for their roles perform all activities.

If we are going to trust the probabilities calculated, we need to trust that the right level of redundancy exists. We need to trust that software developers create their code in a way that makes the existence of bugs with potential dangerous outcomes very unlikely. We need to trust that everybody involved in the SIS development has the right level of competence and experience, and that the organizations involved have systems in place to properly manage the development process and all its requirements. A simple probability estimate does not tell us much, unless it is born in the context of a properly managed SIS development process.

How independent should your FSA leader be?

July 8, 2015July 8, 2015 Håkon OlsenLeave a comment

Functional safety assessment is a mandatory 3rd party review/audit for functional safety work, and is required by most reliability standards. In line with good auditing practice, the FSA leader should be independent of the project development. Exactly what does this mean? Practice varies from company to company, from sector to sector and even from project to project. It seems reasonable to require a greater degree of independence for projects where the risks managed through the SIS are more serious. IEC 61511 requires (Clause 5.2.6.1.2) that functional safety assessments are conducted with “at least one senior competent person not involved in the project design team”. In a note to this clause the standard remarks that the planner should consider the independence of the assessment team (among other things). This is hardly conclusive.

If we go to the mother standard IEC 61508, requirements are slightly more explicit, as given by Clause 8.2.15 of IEC 61508-1:2010, which states that the level of independence shall be linked to perceived consequence class and required SILs. For major accident hazards, two categories are used in IEC 61508:

Class C: death to several people
Class D: very many people killed

For class C the standard accepts the use of an FSA team from “independent department”, whereas for class D only an “independent organization” is acceptable. Further, also for class C, an independent organization should be used if the degree of complexity is high, the design is novel or the design organization is lacking experience with this particular type of design. There are also requirements based on systematic capability in terms of SIL but those are normally less stringent in the context of industrial processes than the consequence based requirements to FSA team independence. The standard also specifies that compliance to sector specific standards, such as IEC 61511, would make a different basis for consideration of independence acceptable.

In this context, the definitions of “independent department” and “independent department” are given in Part 4 of the standard. An independent department is separate from and distinct from departments responsible for activities which take place during the specified phase of the overall system or software lifecycle subject to the validation activity. This means also, that the line managers of those departments should not be the same person. An independent organization is separate by management and other resources from the organizations responsible for activities that take place during the lifecycle phase. This means, in practice, that the organization leading a HAZOP or LOPA should not perform the FSA for the same project if there are potential major accident hazards within the scope, and preferably also not if there are any significant fatal accident risks in the project. Considering the requirement of separate management and resource access, it is not a non-conformity if two different legal entities within the same corporate structure perform the different activities, provided they have separate budgets and leadership teams.

If we consider another sector specific standard, EN 50129 for RAMS management in the European railway sector, we see that similar independence requirements exist for third-party validation activities. Figure 6 in that standard seemingly allows the assessor to be a part of the same organization as an organization involved in SIS development, but further requires for this situation that the assessor has an authorization from the national safety authority, is completely independent form the project team and shall report directly to the safety authorities. In practice the independent assessor is in most cases from an independent organization.

It is thus highly recommended to have an FSA team from a separate organization for all major SIS developments intended to handle serious risks to personnel; this is in line with common auditing practice in other fields.

Why is this important? Because we are all humans. If we feel ownership to a certain process, product or affiliation with an organization, it will inevitably be more difficult for us to point out what is not so good. We do not want to hurt people we work with by stating that their work is not good enough – even if we know that inferior quality in a safety instrumented system may actually lead to workers getting killed at work later. If we look to another field with the same type of challenges but potentially more guidance on independence, we can refer to the Sarbanes-Oxley act of 2002 from the United States. The SEC has issued guidelines about auditor independence and what should be assessed. Specifically they include:

Will a relationship with the auditor create a mutual or conflicting interest with their audit client?
Will the relationship place the auditor in the position of auditing his/her own work?
Will the relationship result in the auditor acting as management or an employee of the audit client?
Will the relationship result in the auditor being put in a position where he/she will act as an advocate for the audit client?

It would be prudent to consider at least these questions if considering using an organization that is already involved in the lifecycle phase subject to the FSA.

What is the difference between software and hardware failures in a reliability context?

July 3, 2015 Håkon OlsenLeave a comment

Reliability engineers have traditionally focused more on hardware than software. There are many reasons for this; one reason is that traditionally safety systems have been based on analog electronics, and although digitial controls and PLC’s have been introduced throughout the 1990’s, the actual software involved was in the beginning very simple. Today the situation has really changed, but the focus in reliability has not completely taken this onboard. One of the reasons may be that reliability experts like to calculate probabilities – which they are very good at doing for hardware failures. Hardware failures tend to be random and can be modeled quite well using probabilistic tools. So – what about software? The failure mechanisms are very different – as failures in hardware are related to more or less stochastic effects stemming from load cycling, material defects and ageing, software defects or completely deterministic (we disregard stochastic algorithms here – they are banned from use in safety critical control system anyway).

Software defects exist for two reasons: design errors (flaws) and implementation errors (bugs). These errors may occur at the requirement stage or during actual coding, but irrespective of the time they occur, they are always static. They do not suddenly occur – they are latent errors hidden within the code – that will active each and every time the software state where the error is relevant is visited.

Such errors are very difficult to include in a probabilistic model. That is why reliability standards prescribe a completely different medicine; a process oriented framework that gives requirements to management, choice of methods and tools, as well as testing and documentation. These quality directed workflows and requirements are put in place such that we should have some confidence in the software not being a significant source of unsafe failures of the critical control system.

Hence – process verification and auditing take the place of probability calculations when we look at the software. In order to achieve the desired level of trust it is very important that these practices are not neglected in the functional safety work. Deterministic errors may be just as catastrophic as random ones – and therefore they must be managed with just as much rigor and care. The current trend is that more and more functionality is moved from hardware to software – which means that software errors are becoming increasingly important to manage correctly if we are not going to degrade both performance and trust of the safety instrumented systems we rely on to protect our lives, assets and the environment.

What does a “SIL” requirement really mean?

July 2, 2015 Håkon OlsenLeave a comment

Safety instrumented systems are often assigned a “Safety Integrity Level”: This is an important concept for ensuring that automatic controls intended to maintain the safety of a technical safety actually bring the risk reduction that is necessary. In the reliability standards IEC 61508 and IEC 61511, there are 4 SILs:

SIL 1: a failure on demand in 1 out of 10 demands is acceptable
SIL 2: a failure on demand in 1 out of 100 demands is acceptable
SIL 3: a failure on demand in 1 out of 1 000 demands is acceptable
SIL 4: a failure on demand in 1 out of 10 000 demands is acceptable

This way of defining the probability of failure applies to so-called “low-demand” systems. In practice that means that the safety function does not need to act more than once per year in order to stop an accident from occurring.

The SIL requirement does not only involve probability calculations (Probability for failure on demand = PFD). The SIL consists of four diffent types of requirements:

Quantitative requirement (PFD, defined as probability of failure when there is a demand for the function)
Semi-quantitative requirements (requirement for redundancy, for a certain number of possible failures of the system leading to a safe state – the socalled safe failure fraction)
Software requirements (a lot of the actual control functionality is implemented in software. For this a work process oriented take on things is required by the standards – implications increase in rigor with increasing SIL)
Qualitative requirements (avoidance of systematic errors, quality mangement, etc.)

Most people focus only on the quantitative part and do not think about the latter thre parts. In order for us to have trust in the probability assessment, it is necessary that issues that cannot be quanitifed are properly managed. Hence – to claim that you have achived a certain SIL for your safety function, you need to document that the redundancy is right, that most failures will lead to a safet state, that your software has been developed in accordance with required practices and using acceptable technologies, and that your organization and workflows ensure sufficient quality of your safety function product and the system it is a part of.

If people buying components for safety instrumented systems would keep this in mind – it would become much easier to actually create safety critical automation systems with can trust with a given level of integrity.

safecontrols

Risk management, reliability and security

Category: IEC 61508