Why functional safety audits are useful

Functional safety work usually involves a lot of people, and multiple organizations. One key success factor for design and operation of safety instrumented systems is the competence of the people involved in the safety lifecycle. In practice, when activities have been omitted, or the quality of the work is not acceptable, this is discovered in the functional safety assessment towards the end of the project, or worse, it is not discovered at all. The result of too low quality is lower integrity of the SIS as a barrier, and thus higher risk to people, assets and environment – without the asset owner being aware of this! Obviously a bad situation.

Knowing how to do your job is important in all phases of the safety lifecycle. Functional safety audits can be an important tool for verification – and for motivating organization’s to maintain good competence mangement systems for all relevant roles.

Competence management is a key part of functional safety management. In spite of this, many companies have less than desirable track records in this field. This may be due to ignorance, or maybe because some organizations view a «SIL» as a marketing designation rather than a real risk reduction measure. Either way – such a situation is unacceptable. One key tool for ensuring everybody involved understands what their responsibilities are, and makes an effort to learn what they need to know to actually secure the necessary system level integrity, is the use of functional safety audits. An auditing program should be available in all functional safety projects, with at least the following aspects:

  • A procedure for functional safety audits should exist
  • An internal auditing program should exist within each company involved in the safety lifecycle
  • Vendor auditing should be used to make sure suppliers are complying with functional safety requirements
  • All auditing programs should include aspects related to document control, management of change and competence management

Constructive auditing can be an invaluable part of building a positive organizational culture – where quality becomes as important to every function involved in the value chain – from the sales rep to the R&D engineer.

One day statements like “please take the chapter on competence out of the management plan, we don’t want any difficult questions about systems we do not have” may seem like an impossible absurdity.

How does your approach to maintenance change the reliability of your system?

Everybody knows that you need to maintain your stuff to make sure it is in top condition; a colloquial way of saying that maintenance is a necessary component of any highly reliable system. In practice, there are three quite common approaches to maintenance of technical systems:

  • If it ain’t broke don’t fix it
  • Periodic inspection and maintenance
  • Risk based maintenance

The first policy is the most commonly applied in private life. The only reason this is at all possible is a combination of high labor cost, relatively cheap mass production of many things, and that insurance companies allow you to do it by paying for stuff that “accidentally” breaks. For reliability, this policy is clearly the least reasonable, and it will negatively affect the reliability of your system. The following picture shows an example of the degradation that may occur when following this policy – taken from a walk in the Danish countryside.

The second policy is maybe the most commonly applied in industry – and for conscious car owners. You take your car to the mechanic once per year, or perhaps every 25000 km, for oil change and a check-up. You test the most important things, and follow up if necessary. This ensures your reliability is not negatively affected by lack of maintenance – but still the work has to be performed with high quality. Maintenance done right reduces probability of severe failures, done wrong it is outright dangerous.

Risk based maintenance uses available data to judge the risk level for your equipment – by using a combination of information gained from inspections and direct measurements. When you actively monitor the risk of failures and plan maintenance activities in accordance with real risks, you can prolong the useful life of your system while maintaining the necessary reliability.

When assessing the reliability of critical systems you should thus include both the philosophy of maintenance and the ability of the organization to execute the maintenance plan in a good way in your assessment. A simple illustration of this is shown in the below picture.

Follow-up of the supply chain for SIL rated equipment

P

Procurement is easy – getting exactly what you need is not. I have previously discussed the challenges related to follow-up of suppliers of SIL rated equipment on this blog, but that was from the perspective of an organization. This time, let’s look at what this means for you, if you are either the purchaser or the package engineer responsible for the deliverable. Basically there are three challenges related to communication in procurement of SIL rated equipment – or procurement of anything for that matter;

  • The purchaser does not understand what the project needs
  • The customer does not understand what the purchaser needs
  • The package engineer does not know that the purchaser does not know what the project needs, and therefore he or she does also not know that the supplier does not know what the project actually needs

This, of course, is recipe for a lot of quarreling and time wasted on finger pointing and the blame game. All of this is expensive, frustrating and useless. What can we do to avoid this problem in the first place? First, everybody needs to know a few basic things about SIL. The standards used in industry are quite heavy reading, and when guidelines for your industry are available, it is a good idea to use them. For the oil and gas industry, the Norwegian Oil & Gas Association’s Guideline No. 070 is a very good starting point. To distill it down to a bare minimum, the following concepts should be known to all purchasers and package engineers:

  • Why does a safety integrity level requirement exist for the function your equipment is a part of?
  • What is a safety integrity level (SIL) in terms of:
    • Quantitative requirements (PFD quota for the equipment)
    • Architectural requirements (hardware fault tolerance, safe failure fraction, etc.)
    • Software requirements
    • Qualitative requirements
  • What are the basic documentation requirements?

When this is known, communication between purchaser and supplier becomes much easier. It also becomes easier for the package engineer and the purchaser to discuss follow-up of the vendors and what requirements should be put in the purchase order, as well as in the request for proposal. Most projects will develop a lot of functional safety documents. Two of the most important ones in the purchasing process are:

  • Safety Requirement Specification (SRS): In this document you find a description of the function your component is a part of, and the SIL requirements to the function. You will also find allocated PFD quotas to each component in the function – this is an important number to use in the purchasing process.
  • A “Vendor guideline for Safety Analysis Reports” or a “Safety Manual Guideline” describing the project’s documentation requirements for SIL rated equipment

So, what can you do to bring things into this nice and orderly state? If you are a purchaser, take a brief SIL primer, or preferably, ask your company’s functional safety person to give you a quick introduction. Then talk to your package engineer about this things when setting out the ITT. If you are a package engineer, invite your purchaser for a coffee, to discuss the needs of the project in terms of these things. If the purchaser does not understand the terminology, be patient and explain. And remember that not everybody has the right background; the engineer may fail to understand some technical details of the purchasing function, and the purchaser may not understand the inner workings of your compressor – but aiming for a common platform to discuss requirements and follow-up of vendors will make life easier for both of you.

Do you calculate failure probabilities during process design?

Process design often follows this pattern in practice:

  1. Draw up P&ID’s and add instrumentation, safety functions and alarms
  2. Perform HAZOP
  3. Change P&ID’s
  4. Perform SIL allocation study
  5. Wait….
  6. Calculate probabilities for failure on demand for safety functions with SIL requirements
  7. Realize this is not going to work
  8. Back to 3

Instead of doing this – which is very expensive – we should calculate the probability for failure on demand while designing the safety functions. This can be done in a number of ways, ranging from relatively coarse and simple to very involved and complex, like Petri nets and Monte Carlo simulations. For design evaluations, simple methods are usually good enough. The simplest of all may be the interpolation of pre-calculated results. Say someone compares a lot of architectures and failure rates and makes a gigantic table of PFD results for you – then you can just look it up. The good news is – somebody already did. You can find such tables in IEC 61508-6, Appendix B. This, we can of course use a spreadsheet to do, like in the example below – no fancy software needed in other words.

Say you have a safety function with a sensor element with λDU = 4 x 10-7, and a logic unit with λDU = 2 x 10-7 and your final elements with λDU = 2.6 x 10-7. You need to comply with a SIL 3 requirement. Using the lookup tables, we then quickly estimate that the PFD is approximately 1.03 x 10-3. This is quite close to SIL 3 performance but since there is some uncertainty in play and we know the final element is usually the problem (it also has the highest failure rate) we opt for a 1oo2 configuration of the final element. Then we obtain 4.7 x 10-4, which is well within a SIL 3 requirement. As a designer, you can do these type of estimates already at point 1) in the sequence above – and you will save yourself a lot of trouble, delays and costs due to changes later in your design project.

For a small fee of 20 USD you can download the spreadsheet used in this example, capable of performing many different PFD calculations for 1-year test intervals (including diagonistics, common cause failures and redundancies). Get it here: Spreadsheet for SIL Calculations.

When does production downtime equal lost production?

Running a large factory or an oil platform costs a lot of money. It costs a lot of money to build the thing, it costs a lot of money to run the thing. The only reason people build and run these things is that they get even more money in, than they have to spend on building and operating their assets. In this post we are going to look at the operating expenditure and loss related to downtime on oil production platforms. What is special about these platforms compared to other factories is that they are producing from a more or less fixed reservoir of resources, and when you have depleted the reservoir, the party is over. Production downtime is often included as a risk category in SIL allocation work – often hidden under a somewhat diffuse name such as “asset impact” or “financial loss”. An accident causes financial loss in many ways:

  • Lost market confidence
  • Lost contract opportunities
  • Direct production loss
  • Repair costs
  • Extra money needed for marketing and branding

In risk discussions we normally think about direct production loss and repair costs – the other things are not included. This is OK. We’ll put the repair costs aside for now, and look at three different views people tend to take when discussing production downtime.

  • Lost production is a complete loss of income
  • Lost production means simply that income will come at the end of the production horizon and there is no loss provided the oil price stays the same
  • Lost production will lead to a loss measured in present value depending on the oil price and the cost of capital

The first one is the easiest to use in assessments (3 days of downtime x 10 million dollars per day = 30 million dollars lost). The second one means that you do not care about downtime – and is not very realistic. Usually someone will convince the person claiming this that he or she is wrong. The last option is obviously the most “correct” but also the most difficult to use for anything in practice. Who can tell what the oil price is in 20 years? And in particular, which engineer can do this calculation in real time during a risk assessment discussion? These obvious difficulties often lead people to go with the “it’s all lost” option. Let’s have a look at what this really means for two different scenarios. The assumptions we make is that one day of production is worth 10 million dollars and that all operating expenses are fixed. We compare a production horizon of 5 years with a production horizon of 20 years. So we need to compare the value of 1 day of production now, in 5 years and in 20 years in terms of present value. Since we do not know the cost of capitcal for this particular operator we will calculate with a discount factor similar to what could be obtained by not trying to recover the production but rather investing the same amount of money in company stock. For simplicity we assume a 7% discount rate (it’s an arbitrary choice but not orders of magnitude away from what can be expected). The present value of 1 day of production in 5 years is thus:

5 years deferred production: 10 million / (1+0.07)5 = 7.1 million

20 years deferred production: 10 million / (1 + 0.07)20 = 2.6 million

We see that deferring production by 5 years at an opportunity cost of 7% per year we get a present value loss of 29%, whereas for deferring production for 20 years we lose 74%. For a 7% opportunity cost we can draw this graph for cost of deferred production in per cent of the current value:

This should not be a topic for discussion during a risk assessment workshop but should be baked into the risk acceptance criteria. For projects with long production horizons (say 15-20 years), a day of downtime may conservatively be assumed to be lost. For shorter durations the asset owner should define acceptable frequency of downtime, e.g.

1-3 days downtime: 0-10 years between

4-10 days downtime: 10 – 30 years between

..

This type of acceptance definition is more practical to use than a dollar value.

Uncertainty and effect of proof test intervals on failure probabilities of critical safety functions

When we want to make sure that the integrity of our safety function is good enough, we use a calculation of probability of failure on demand (PFD) to check against the required reliability level. The requirement comes from a risk analysis as seen up against risk acceptance criteria. But what are we actually calculating, and how do uncertainties and selection of safety intervals change the results?

Most probability calculations are done for the time-averaged value of the probability of failure on demand. Normally we assume that a proof test will discover all failure modes; that is, we are assuming that the test coverage of our proof test is 100%. This may be unrealistic, but for the time being, let us just assume that this is correct. The average PFD for a single component can be calculated as

PFDavg = λDU x τ / 2,

where λDU is the failure rate per hour and τ is the proof test interval in hours. Let us now consider what the instantaneous probability of failure on demand is; this value grows with time after a proof test, where it is assumed that the PFD is zero at time zero for 100% proof coverage. The standard model for component reliability follows an exponential distribution. This gives the probability density function for the exponential distribution:

PFD(t) = 1 – e-λt .

Effect of proof test interval and the time-variation of the PFD value

The instantaneous probability of a failure on demand can thus be plotted as a function of time. With no testing the failure probability approaches one as t → ∞. With the assumption of 100% proof test coverage, we “reset” the PFD to zero after each test. This gives a “sawtooth” graph. Let us plot the effect of proof testing, and see how the average is basically the probability “in the middle” of the saw-tooth graph. This means that towards the end of your test interval the probability of a failure is almost twice the average value, and in the beginning it is more or less zero.

In this example the failure rate is 10-5 failures per hour and the proof test interval is 23000 hours (a little more than two and a half year). By increasing the frequency of testing you can thus lower your average failure probability, but in practice you may also introduce new errors. Remember that about half al all accidents are down to human errors – several of those during maintenance and testing!

Effect of uncertainty in failure rate data on calculated PFD values

Now to the second question – what about uncertainty in data? For a single component the effect is rather predictable. Let us use the same example as above but we want to investigate what the effect of uncertainty on λ is. Let us say we know the failure rate is between 0.70 and 1.3 times the assumed value of 10-5. Calculating the same saw-tooth function then gives us this picture:

We can see that the difference is quite remarkable, just for a single component. Getting good data is thus very important for a PFD calculation to be meaningful. The average value for the low PFD estimate is 0.08, whereas for the high PFD estimate it is 0.15 – almost twice as high!

Let us now consider what happens to the uncertainty when we combine two components in series as in this reliability block diagram:

These two components are assumed to be of identical failure rates and with the same uncertainty in the failure rate as above. If both have failure rates at the lower end of the spectrum, we get an overall PFD of PFD1+PFD2 = 0.16. If, on the other hand, we look at the most conservative result, we end up with 0.30. The uncertainties add up with more components – hence, using optimistic data may cause non-conservative results for serial connections with many components. Now if we turn to redundant structures, how do uncertainties combine? Let us consider a 1oo2 voting of two identical components.

The PFD for this configuration may be written as follows:

In this expression, which is taken from the PDS method handbook, the first part describes the PFD contribution from a common cause failure in both components (such as defects from production, clogged measurement lines on same mount point, etc.), whereas the second part describes simultaneous but independent failures in both components. The low failure rate value gives a PFD = 0.10, the high failure rate gives PFD = 0.20 and the average becomes 0.15. In both cases the relative uncertainty in the PFD is the same as the relative uncertainty in the λ value – this is because the calculations only involve linear combinations of the failure rate – for more complex structures the uncertainty propagation will be different.

What to do about data

This shows that the confidence you put in your probability calculations come down to the confidence you can have in your data. Therefore, one should not believe reliability data claims without good backup on where the numbers are coming from. If you do not have very reliable data, it is a good idea to perform sensitivity analysis to check the effect on the overall system.

 

Progressing from design to operation with a SIL rated system

Many companies operating industrial production systems have learned how to use risk assessments and safety integrity levels during design and process development. Many have however asked how do we actually work with this in operations to make sure the safety functions actually provide the level of safety we need. Maintaining the safety ingrity level throughout the operational part of the asset’s lifecycle can actually be very demanding, and it requires a holistic view of asset management considering many aspects. A good asset management program needs to make sure design requirements are fulfilled; it needs to have provisions for monitoring the state of the asset for damage or degradation such as corrosion, instrument drift or material defects. It must also prioritize such that maintenance is effective and does not drive costs in unhealthy ways. Asset management and barrier integrity management is thus no easy task.

When taking a system from design to operation we are equipped with theoretical foundations and plans for how to use the asset. We do not have operational experience, and we do not know how the asset actually will perform in practice. We need to take what we have learned during engineering and transform this into a system for managing our assets in a way that includes barrier integrity, and that takes the requirements and limitations of SIL rated systems into practice. Necessary functions and considerations for establishing a good barrier management system are shown in the figure below. You should include planning for operations already when establishing the functional safety management plan in the design phase.

We need to take with us the safety requirements from engineering into the barrier management system. For your safety instrumented system this would consist of information found in the Safety Requirement Specification (SRS) and the risk assessments used to establish the SRS. The reason for the latter is that we need to make sure that the assumptions about other independent protection layers are not violated, or that protection layers do not disappear. Further, your company needs to have performance standards for different systems – these should also be integrated into your barrier management system. Finally, from a practical and economical point of view, you need to take your maintenance and spare parts philosophy to the next level by implementing the necessary maintenance activities for barrier elements in your barrier management system.

Monitoring for safety is very important if you want your risk management system to work. For SIL rated systems there are many sources of performance data. These should at least include results from proof testing, from diagnostics and automated monitoring systems, and from maintenance focused inspections. All of these data should be analyzed using suitable tools, and the results of this analysis should be taken into your overall barrier management data storage or data warehouse. Based on the data gathered and the state of the barrier system, you need to device actions and make sure they are done in due time to avoid deterioration of the system.