Do you calculate failure probabilities during process design?

Process design often follows this pattern in practice:

  1. Draw up P&ID’s and add instrumentation, safety functions and alarms
  2. Perform HAZOP
  3. Change P&ID’s
  4. Perform SIL allocation study
  5. Wait….
  6. Calculate probabilities for failure on demand for safety functions with SIL requirements
  7. Realize this is not going to work
  8. Back to 3

Instead of doing this – which is very expensive – we should calculate the probability for failure on demand while designing the safety functions. This can be done in a number of ways, ranging from relatively coarse and simple to very involved and complex, like Petri nets and Monte Carlo simulations. For design evaluations, simple methods are usually good enough. The simplest of all may be the interpolation of pre-calculated results. Say someone compares a lot of architectures and failure rates and makes a gigantic table of PFD results for you – then you can just look it up. The good news is – somebody already did. You can find such tables in IEC 61508-6, Appendix B. This, we can of course use a spreadsheet to do, like in the example below – no fancy software needed in other words.

Say you have a safety function with a sensor element with λDU = 4 x 10-7, and a logic unit with λDU = 2 x 10-7 and your final elements with λDU = 2.6 x 10-7. You need to comply with a SIL 3 requirement. Using the lookup tables, we then quickly estimate that the PFD is approximately 1.03 x 10-3. This is quite close to SIL 3 performance but since there is some uncertainty in play and we know the final element is usually the problem (it also has the highest failure rate) we opt for a 1oo2 configuration of the final element. Then we obtain 4.7 x 10-4, which is well within a SIL 3 requirement. As a designer, you can do these type of estimates already at point 1) in the sequence above – and you will save yourself a lot of trouble, delays and costs due to changes later in your design project.

For a small fee of 20 USD you can download the spreadsheet used in this example, capable of performing many different PFD calculations for 1-year test intervals (including diagonistics, common cause failures and redundancies). Get it here: Spreadsheet for SIL Calculations.

When does production downtime equal lost production?

Running a large factory or an oil platform costs a lot of money. It costs a lot of money to build the thing, it costs a lot of money to run the thing. The only reason people build and run these things is that they get even more money in, than they have to spend on building and operating their assets. In this post we are going to look at the operating expenditure and loss related to downtime on oil production platforms. What is special about these platforms compared to other factories is that they are producing from a more or less fixed reservoir of resources, and when you have depleted the reservoir, the party is over. Production downtime is often included as a risk category in SIL allocation work – often hidden under a somewhat diffuse name such as “asset impact” or “financial loss”. An accident causes financial loss in many ways:

  • Lost market confidence
  • Lost contract opportunities
  • Direct production loss
  • Repair costs
  • Extra money needed for marketing and branding

In risk discussions we normally think about direct production loss and repair costs – the other things are not included. This is OK. We’ll put the repair costs aside for now, and look at three different views people tend to take when discussing production downtime.

  • Lost production is a complete loss of income
  • Lost production means simply that income will come at the end of the production horizon and there is no loss provided the oil price stays the same
  • Lost production will lead to a loss measured in present value depending on the oil price and the cost of capital

The first one is the easiest to use in assessments (3 days of downtime x 10 million dollars per day = 30 million dollars lost). The second one means that you do not care about downtime – and is not very realistic. Usually someone will convince the person claiming this that he or she is wrong. The last option is obviously the most “correct” but also the most difficult to use for anything in practice. Who can tell what the oil price is in 20 years? And in particular, which engineer can do this calculation in real time during a risk assessment discussion? These obvious difficulties often lead people to go with the “it’s all lost” option. Let’s have a look at what this really means for two different scenarios. The assumptions we make is that one day of production is worth 10 million dollars and that all operating expenses are fixed. We compare a production horizon of 5 years with a production horizon of 20 years. So we need to compare the value of 1 day of production now, in 5 years and in 20 years in terms of present value. Since we do not know the cost of capitcal for this particular operator we will calculate with a discount factor similar to what could be obtained by not trying to recover the production but rather investing the same amount of money in company stock. For simplicity we assume a 7% discount rate (it’s an arbitrary choice but not orders of magnitude away from what can be expected). The present value of 1 day of production in 5 years is thus:

5 years deferred production: 10 million / (1+0.07)5 = 7.1 million

20 years deferred production: 10 million / (1 + 0.07)20 = 2.6 million

We see that deferring production by 5 years at an opportunity cost of 7% per year we get a present value loss of 29%, whereas for deferring production for 20 years we lose 74%. For a 7% opportunity cost we can draw this graph for cost of deferred production in per cent of the current value:

This should not be a topic for discussion during a risk assessment workshop but should be baked into the risk acceptance criteria. For projects with long production horizons (say 15-20 years), a day of downtime may conservatively be assumed to be lost. For shorter durations the asset owner should define acceptable frequency of downtime, e.g.

1-3 days downtime: 0-10 years between

4-10 days downtime: 10 – 30 years between

..

This type of acceptance definition is more practical to use than a dollar value.

Uncertainty and effect of proof test intervals on failure probabilities of critical safety functions

When we want to make sure that the integrity of our safety function is good enough, we use a calculation of probability of failure on demand (PFD) to check against the required reliability level. The requirement comes from a risk analysis as seen up against risk acceptance criteria. But what are we actually calculating, and how do uncertainties and selection of safety intervals change the results?

Most probability calculations are done for the time-averaged value of the probability of failure on demand. Normally we assume that a proof test will discover all failure modes; that is, we are assuming that the test coverage of our proof test is 100%. This may be unrealistic, but for the time being, let us just assume that this is correct. The average PFD for a single component can be calculated as

PFDavg = λDU x τ / 2,

where λDU is the failure rate per hour and τ is the proof test interval in hours. Let us now consider what the instantaneous probability of failure on demand is; this value grows with time after a proof test, where it is assumed that the PFD is zero at time zero for 100% proof coverage. The standard model for component reliability follows an exponential distribution. This gives the probability density function for the exponential distribution:

PFD(t) = 1 – e-λt .

Effect of proof test interval and the time-variation of the PFD value

The instantaneous probability of a failure on demand can thus be plotted as a function of time. With no testing the failure probability approaches one as t → ∞. With the assumption of 100% proof test coverage, we “reset” the PFD to zero after each test. This gives a “sawtooth” graph. Let us plot the effect of proof testing, and see how the average is basically the probability “in the middle” of the saw-tooth graph. This means that towards the end of your test interval the probability of a failure is almost twice the average value, and in the beginning it is more or less zero.

In this example the failure rate is 10-5 failures per hour and the proof test interval is 23000 hours (a little more than two and a half year). By increasing the frequency of testing you can thus lower your average failure probability, but in practice you may also introduce new errors. Remember that about half al all accidents are down to human errors – several of those during maintenance and testing!

Effect of uncertainty in failure rate data on calculated PFD values

Now to the second question – what about uncertainty in data? For a single component the effect is rather predictable. Let us use the same example as above but we want to investigate what the effect of uncertainty on λ is. Let us say we know the failure rate is between 0.70 and 1.3 times the assumed value of 10-5. Calculating the same saw-tooth function then gives us this picture:

We can see that the difference is quite remarkable, just for a single component. Getting good data is thus very important for a PFD calculation to be meaningful. The average value for the low PFD estimate is 0.08, whereas for the high PFD estimate it is 0.15 – almost twice as high!

Let us now consider what happens to the uncertainty when we combine two components in series as in this reliability block diagram:

These two components are assumed to be of identical failure rates and with the same uncertainty in the failure rate as above. If both have failure rates at the lower end of the spectrum, we get an overall PFD of PFD1+PFD2 = 0.16. If, on the other hand, we look at the most conservative result, we end up with 0.30. The uncertainties add up with more components – hence, using optimistic data may cause non-conservative results for serial connections with many components. Now if we turn to redundant structures, how do uncertainties combine? Let us consider a 1oo2 voting of two identical components.

The PFD for this configuration may be written as follows:

In this expression, which is taken from the PDS method handbook, the first part describes the PFD contribution from a common cause failure in both components (such as defects from production, clogged measurement lines on same mount point, etc.), whereas the second part describes simultaneous but independent failures in both components. The low failure rate value gives a PFD = 0.10, the high failure rate gives PFD = 0.20 and the average becomes 0.15. In both cases the relative uncertainty in the PFD is the same as the relative uncertainty in the λ value – this is because the calculations only involve linear combinations of the failure rate – for more complex structures the uncertainty propagation will be different.

What to do about data

This shows that the confidence you put in your probability calculations come down to the confidence you can have in your data. Therefore, one should not believe reliability data claims without good backup on where the numbers are coming from. If you do not have very reliable data, it is a good idea to perform sensitivity analysis to check the effect on the overall system.

 

Progressing from design to operation with a SIL rated system

Many companies operating industrial production systems have learned how to use risk assessments and safety integrity levels during design and process development. Many have however asked how do we actually work with this in operations to make sure the safety functions actually provide the level of safety we need. Maintaining the safety ingrity level throughout the operational part of the asset’s lifecycle can actually be very demanding, and it requires a holistic view of asset management considering many aspects. A good asset management program needs to make sure design requirements are fulfilled; it needs to have provisions for monitoring the state of the asset for damage or degradation such as corrosion, instrument drift or material defects. It must also prioritize such that maintenance is effective and does not drive costs in unhealthy ways. Asset management and barrier integrity management is thus no easy task.

When taking a system from design to operation we are equipped with theoretical foundations and plans for how to use the asset. We do not have operational experience, and we do not know how the asset actually will perform in practice. We need to take what we have learned during engineering and transform this into a system for managing our assets in a way that includes barrier integrity, and that takes the requirements and limitations of SIL rated systems into practice. Necessary functions and considerations for establishing a good barrier management system are shown in the figure below. You should include planning for operations already when establishing the functional safety management plan in the design phase.

We need to take with us the safety requirements from engineering into the barrier management system. For your safety instrumented system this would consist of information found in the Safety Requirement Specification (SRS) and the risk assessments used to establish the SRS. The reason for the latter is that we need to make sure that the assumptions about other independent protection layers are not violated, or that protection layers do not disappear. Further, your company needs to have performance standards for different systems – these should also be integrated into your barrier management system. Finally, from a practical and economical point of view, you need to take your maintenance and spare parts philosophy to the next level by implementing the necessary maintenance activities for barrier elements in your barrier management system.

Monitoring for safety is very important if you want your risk management system to work. For SIL rated systems there are many sources of performance data. These should at least include results from proof testing, from diagnostics and automated monitoring systems, and from maintenance focused inspections. All of these data should be analyzed using suitable tools, and the results of this analysis should be taken into your overall barrier management data storage or data warehouse. Based on the data gathered and the state of the barrier system, you need to device actions and make sure they are done in due time to avoid deterioration of the system.

The problem is not with your brain, Sir, it is the number of limbs

Functional safety design is a concurrent activity with process design. In a process plant, the process engineer will specify the safety functions for a given unit. Take for example a pressure vessel with several inlets, such as a typical second-stage separator in an oil-production train. This is a pressure vessel separating oil and gas by gravity, and typically the number of inlets is quite high. This is because the main production stream enters this separator, together with recovered oil from multiple units downstream the separation train, for example compression train scrubbers, an electrostatic coalesce, reclaimed oil sump from the flare knock-out drum, etc. A simplified drawing of this may look as follows:

This pressure vessel is equipped with an automatic safety trip – a level alarm high high (PAHH). When the high high level is reached, the process shutdown system (PSD) will automatically stop inflow to the tank to avoid the level from increasing any further. Let us say this function has to satisfy a SIL 2 requirement, and that we have the following data available for the process shutdown valves (shown on the drawing), for the PSD logic node including I/O cards, and for the level transmitter:

Equipment type

Failure rate (Dangerous undetected failures per million hours)

Valve with actuator

0.5

Solenoid

0.3

Safety PLC with I/O cards

0.2

Level transmitter

0.3

(These data are made up for this example – use real data in real applications)

The formula for calculating the average probability of failure on demand for a function with no redundancy is PFD = λDU x τ / 2, where λDU is the failure rate for dangerous undetected failures per hour, and τ is the proof test interval between each time the function is fully tested. If we assume that we test this function once per year, we can calculate the overall PFD:

PFD = [PFDValve + PFDSolenoid ] x NValves + PFDPLC + PFDTransmitter

If we calculate this with the above data and 5 critical valves we get a PFD of 0.02. For a SIL 2 function we have to get below 0.01 – so this function is not reliable enough. Which options do we have, and what would be the best way to deal with this?

  1. Buy more reliable components?
  2. Redesign the process?
  3. Introduce other risk reduction systems to reduce the SIL requirement?

All of these changes could be useful – alone or together. However, it is a general problem that we get too many final elements in safety trips – and valves are typically much less reliable than electrical components. Therefore, it makes sense to reduce the number of valves. Actual valves can be more reliable than this – but they may also be less reliable. Considering reliability requirements when buying the valve thus is essential. In our case, we can look more closely at the system we are trying to protect;

  • Is the MEG injection really a big contributor? Maybe this line is normally not used, and the line size is very small? In that case we may choose not to include that valve in the SIL loop – although we will actually close it. But it is not critical to the safety of the system.
  • Can we combine some of the feed flows into a header – and locate one shutdown valve on that header instead of having individual valves? All flows are carrying oil – there is no reason to expect chemical incompatibilities. Let us say we confer with the process engineers and they agree on this.

We then have a changed process (we used option b).

With this changed design – what is the PFD now? We can recalculate with only 2 critical valves and end up with PFD = 0.009. This is below 0.01 and is acceptable for a SIL 2 application.

Points to remember

  • Be careful to avoid too many final elements in a safety function
  • Always make sure you buy equipment with the required reliability and sufficient documentation
  • When you need to change something – consider several options to avoid sub-optimizing your design

Treat your people well if you want your safety systems to work

Reliability standards all state requirements to planning and managing the functional safety work throughout the lifecycle. There are good and bad ways of doing this, and there is room for interpretation of the requirements in the standard. I’ve previously suggested 4 golden practices for good functional safety management – based on experience with what does not work in complex project organizations in different real-world offshore projects. This time we turn to another important aspect of maintaining focus, drive and high quality in all types of projects. That is how to plan and run a project with the well being of the project members and stakeholders during the process in mind.

“Making a small mistake due to stress and overload can have severe consequences when you are designing a barrier system.”

All resource constrained projects are running the risk of sliding into organizational overload. Overload is when individuals and organizations are fully saturated with workload; a situation where additional stress cannot be handled without decreasing performance. For an individual or organization in overload state, dysfunctional behaviors and dramatically reduced productivity can be expected. Four important factors related to perceived stress levels are competence, comfort, confidence and control. If these “4 C’s” are disrupted, stress increases rapidly and the organization or individual may go into a state of overload. There are many contributors to perceived stress levels for individuals; some of them may be called baseline factors such as type of organization and the organizational culture. Factors typically affecting individual stress levels within a project group are:

•    Scope creep without adequate resource allocation – leading to unrealistically high workloads

•    Need to apply technologies not understood by team members, and without appropriate training

•    Inadequate sponsor and management support

•    Lack of change management capabilities within project team

•    Management pushing for high-risk project without being willing to accept potential failures.

So, how would an overload situation affect the outcomes of SIL work? Would the system be passed through to operations with insufficient quality, or would lower quality products be caught in one or several verification acitivites and be rectified before the system is put to operation? Both scenarios are quite likely to occur. If overload occurs in early phases of the project, it is likely that errors would sneak into the requirement setting phase. There are fewer formal checks on the requirements themselves than on the implementation. The most likely verification activity to catch an error in the requirement phase would be during the functional safety assessment (FSA). It is recommended to have one FSA just after the requirement phase, but many projects skip this and wait until further into the SIS development part of the lifecycle. Discovering errors in requirements late in the project would have significant schedule and cost impact. Worse yet, finding errors in the requirements phase is much more difficult than finding implementation errors later. In software development, it is well-known that about 40-50% of bugs originate from errors in specification. Without specific verification activities on the requirement setting, this is likely to be even worse. For an interesting discussion on software bugs and specifications, see this old blog post from Tyner Blain; http://tynerblain.com/blog/2006/01/22/where-bugs-come-from/. One type of error would be to include protection layers that are not really dependable into the risk assessment, generating a lower SIL requirement than necessary. Another error would be to overlook identified hazards, such that there are “holes in the safety net” and effectively zero integrity for that type of scenario. Consequences of organizational overload can thus be severe in a project related to functional safety.

Strategies to counter risk of overload within a project are related to active management and control, both on the people level, as well as on measures of progress and cost control. Practices that are known to reduce likelihood of overload conditions on individual or group levels are:

•    Make few changes, and carefully mange changes when they are necessary

•    Check time constraints and resource allocation – management should not assign unrealistic time constraints on tasks. Most project management frameworks include the notion of “slack planning” to incorporate this task related schedule risk

•    Keep sustaining sponsors actively involved. Lack of interest from senior sponsors is very visible to project members and may hurt the energy levels available to perform at the expected level

•    Improve change capacity by managing talent within the project actively; this includes training, offering career possibilities when project nears delivery as well as succession planning for key roles.

From these bullet points, that are taken from PMI recommendations for handling overload risk, we conclude that taking care of your people is essential for good functional safety. Good competence management is a key to achieving this; lack of confidence in abilities is one of the most common stressors in projects where there is also a high degree of time pressure. Allocating resources such that there are not enough manhours available is just adding fuel to the fire. I have previously referred to this as “stupid resource planning“, and unfortunatley this is quite common, especially when there are schedule slips in projects.

To sum it up – make sure your people are not exposed to negative stressors over time. Maintain a positive attitude as a project manager – you are the leader of your people working on the project. In that respect – the most important thing of all is – “catch your people doing something right” – give praise whenever it is deserved. That takes the edge off of people and motivates them – the best cure there is against negative stress.

Building barriers against major accidents

We all have someone we love – a life partner, kids, friends, family or even a dog. These are the most important things in our lives – and we care deeply about the wellbeing of these special people (and animals) in our lives. We trust employers to make workplaces safe such that our most important ones can come back safely from work every day. Some workplaces have inherent dangers that are exposing people to unacceptable risks unless handled in a good way. How do we manage the most severe accident risks, such as explosion risk on an offshore oil platform, nuclear accidents, or releases of toxic chemicals, such as the horrific 1984 Bhopal accident?

When we build and operate such plants we need to know what the hazards are, and we need to plan barriers to avoid accident scenarios from developing. Risk management is thus integral to all sound engineering activity. A good description of a risk management process is given in ISO 30001 – such a process consists of several steps that should be familiar to practicing engineers and plant managers.

In the figure you can see this risk process explained. First of all, it is necessary to establish the context so that we can understand the impact of the risk – this is we need to ask questions such as;

  • What is the business environment we are operating in?
  • Who will be present and exposed to the risk?
  • What type of training do these people have?
  • Where is the plant located?
  • Etc., etc.

Then, we work to identify risks. In a process plant this activity is typically done in a number of workshop meetings such as design reviews, and maybe the most important, the HAZOP (a Hazard and operability study). The risks identified are then analyzed, to see what the overall risk to the asset and the people operating it are. Based on the risk analysis, the risk is evaluated up against acceptance criteria; is the risk acceptable, or do we need to devise some scheme to lower the risk?

In most cases where major accident hazards are possible, some form of risk treatment is necessary. In fact, an overall principle for barriers against major accident hazards (MAH’s) that is included in many legislations is:

“A single failure shall not lead directly to an unacceptable outcome.”

This leads us directly to our next natural line of thought; we need to build barriers into our process to stop accidents from happening, or to at least make sure an accident development path is changed to avoid unacceptable outcomes.

Common practice in process engineering is to require two barriers against accident scenarios, and these shall be different in working principle and be able independently to stop an accident from occurring. In practice, one of these barriers would typically be mechanical system not relying on electronics at all – such as a spring-loaded pressure relieving valve. The other barrier is typically implemented in an automation system as a safety trip. It is to this latter barrier type, the Safety Instrumented Function (SIF) we apply the concept of safety integrity levels (SIL) and the reliability standards IEC 61511 and IEC 61508.

Taking overpressure in a pressure vessel as an example, we see how these barriers work to stop an accident from occurring. Assume a pressure vessel has a single feed coming from a higher pressure source, but where the pressure is reduced before entry into the vessel by using a pressure reduction valve (a choke valve). As long as the design pressure (the maximum allowable working pressure, MAWP) of the pressure vessel is below the pressure of the source, we have a potential for overpressurizing the tank. This is always dangerous – and particularly so if the contents are flammable (hydrocarbon gases, anyone?) or toxic (try googling methyl icocyanate). Clearly, in this situation, a single error in the choke valve can lead to a large release of dangerous material. Such errors may be due to material failure of the valve (e.g. fatigue), maloperation or a control system error if the valve is an actuated valve used as final element in a control system, for example for production rate control. Process safety standards, such as ISO 10418 or API RP 14C require such pressure vessels to be equipped with pressure safety valves, that will release the pressure to a safe location when the design pressure is exceeded (typically the gas is burnt in a controlled flaring process). That is one barrier. Another barrier would be to install a pressure transmitter on the tank, and a safety valve that will shut off the supply of the gas from the pressure source. This valve and measurement should be connected to a control system that is independent of the normal process control system – to avoid a failure in the control system from also disabling the barrier function.

To sum it up; by systematically identifying risks and evaluating them against acceptance criteria we have a good background for introducing barriers. All accident scenarios should be controlled with at least two independent barriers, where one of them should be instrumented and the other one preferably not. Instrumented functions should be in addition to the basic control system to avoid common cause failures. The Safety Instrumented System (SIS) should be designed in accordance with applicable reliability standards to ensure sufficient integrity. Finally – the design must comply with local regulations and required industry practice and guidance – such as applicable international or local standards.

Contracts, interfaces and safety integrity

What does contract structures have to do with the safety of an industrial plant? A whole lot, actually. First, let us consider how contract structures regulate who does what on a large engineering and construction project. Normally, there will be an operator company that wants to build a new plant, be it a refinery, a chemical plant or an offshore oil platform. Such companies do not normally perform planning and construction themselves, nor do they plan what has to be done and separate this into many small work packages. They outsource the engineering, construction and installation to a large contractor – in the form of an EPC contract. The contractor is then responsible for planning, engineering and construction in accordance with contract requirements. Such contract requirements will consist of many commercial and legal provisions, as well as a large range of technical regulations. On the technical side, the plant has to be engineered and built in accordance with applicable laws and regulations for the location the plant is to be commissioned and used, as well as to company policies and standards, as defined by the operating company.

What is the structure of the EPC contractor’s organization then, and how does this structure influence the safety of the final design? There is a lot of variation out there, but common to all large projects is:

  • A mix of employees and contractors working for the EPC company
  • Separation of engineering scope into EPC contractor scope and vendor scopes
  • Interface management is always a challenge

So – the situation we have is that long-term competence management is difficult due to a large number of contractors being involved. Communication is challenging due to many organizational interfaces. There is a significant risk of scope overlap or scope mismatch between vendor scopes. Finally, some interfaces will work well, and some will not.

Management of functional safety is a lifecycle activity that ties into many parts of the overall EPC scope. Hence, it is critical that everyone involved understands what his or her responsibilities are. Unfortunately, the competence level of various players on this field is highly variable; and an overall competence management scheme is hard to implement. The closest tool available across company interfaces is functional safety audits – a tool that seems to be largely underutilized.

Contracts tend to include functional safety requirements simply by reference to a standard. This may be sufficient in the situation where both parties fully comprehend what t this means for the scope of work, but most likely, there will be need for clarification regarding split of the scope, even in this case. In order to make interface management easer (or even feasible), the scope split should be included in the contract, as well as requirements to communication across interfaces and the existence of role descriptions with proper competence requirements. This would then be easier to work with for the people involved, including HR, procurement, quality assurance, HSE and other management roles.

A quest for knowledge – and the usefulness of your HR department in functional safety management

Most firms claim that their people is the most important asset. If this has any effect on operations, is another thing – some actually mean it and others seem not to do so much about keeping their people well-equipped for the tasks they need  to do.

knowledge_management

When it comes to functional safety, competence management is a very important part of the game. In many projects, one of the major challenges is related to getting the right information and documentation from suppliers. Why is this so difficult? It comes down to work processes, communication and knowledge, as discussed in a previous post. One requirement common to IEC 61508 and 61511 is that every role involved in the safety lifecycle should be competent to perform his or her function. In practice, this is only fulfilled if each of these roles have clear competence requirement descriptions and expectations, a description of how competence will be assessed, and how knowledge will be created for those roles.

There are many ways of training your people, and this is a huge part of the field of HR. Most likely, people in your company’s HR functions actually know a great deal about planning, organizing and executing competence development programs. Involving them in your functional safety management planning can thus be a good idea! A few key issues to think about:

  • What are the requirements for your key roles?
  • What are your key roles (package engineer, procurement specialist(!), instrument engineer, project manager, etc., etc.)?
  • How do you check if they have the right competence? (peer assessment, tests, interviews, experience, etc.)?
  • What training resources do you have available? (Courses, e-learning, on-the-job-training, self-study, etc.)?
  • How often do you need to reassess competence?
  • Who is responsile for this system? (HR, project manager, functional safety engineer, etc.)?

A firm that has this firmly in place will most likely be able to steer their supply chain and help them also gain confidence and knowledge – vastly improving communication across interfaces and thereby also the quality of cross-organizational work.

Taking the human factor into account when setting SIL requirements

A well-known fact from accident investigations is that the human factor plays a huge role. In many large accidents, the enquiry will mention organizational factors, leadership focus, procedures and training as important factors in a complex picture involving both human factors and technological factors. In the oil and gas industry it has been found that more than half of the gas leaks detected offshore are down to human factors and errors made during operation, maintenance or startup. On the other hand – humans may also play the role of the safeguard – an operator may choose to shut down a unit behaving suspiciously prior to any dangerous situation occurring, a vehicle driver may slow down to avoid relying heavily on the ABS system for braking on icy roads, an electrician suggests to exchange a discolored socket that otherwise is well-functioning. All of these are human actions that lower the risk. The human thus always comes into the risk picture and can both enhance the safety, and threaten the safety of an asset. This all depends on leadership, training, organizational maturity and attitudes. How do we deal with this in the context of safety integrity levels?

There are many practices. There are thorough methodologies for analysis of human performance as part of barrier systems available, such as human reliability analysis (HRA), developed first in the nuclear industry but now also commonplace in many sectors (petroleum, chemical industry, aviation and transport). On the other side there are the extremes of assuming “humans always fail to do the correct thing” or “humans always do the right thing”. When performing a SIL allocation analysis using typical methods for this such as layers of protection analysis or RiskGraph (both described in IEC 61511), an important thing to consider is: can the bad things be avoided by human intervention? In many cases humans can intervene, and then we do need to have a notion of how reliable the human is. Human performance is influenced by many factors, and these factors are analyzed in depth in the framework of HRA. During a LOPA very detailed analysis of the human contribution is usually not within the scope, and a more simple approach is taken. However, there are some important questions we can bring from the HRA toolbox that will help us build more trust into the numbers we use in the LOPA, or the trust we put in this barrier element in the RiskGraph:

  • Is the operator well-trained and is the task easy to understand?
  • Does the operator have the necessary experience?
  • Does the organization have a positive safety culture?
  • Are there many tasks to handle at once and no clear priorities?
  • Is the situation stressful?
  • Does the operator have time to comprehend the situation, analyze the next action and execute before it is too late?

In many cases the operator will be well-trained exactly for the accident scenarios in question. Also, if designed correctly, there will be clear alarm prioritization and helpful messages from the alarm system – but it is always good to challenge this because quality of alarm design is varying a lot in practice. The situation is almost always stressful if the consequence of the accident is grave and there is some confusion to the situation but training can do wonders in handling such situations by resorting to reflex operating steps – think of basic training of field skills in the military. The last question is always important – does the operator have enough time? What enough time is, can also be hard to give a fixed limit on; for simple situations it is maybe sufficient with 10-15 minutes, whereas for more complex situations maybe a full hour would be needed for human intervention to be a trustworthy barrier element. Companies may have different guidelines regarding these factors – it should always be considered if these guidelines are in line with current knowledge of human performance. No shorter reaction times than 15 minutes should be allowed in analysis if credit is given to the operator. For unusual scenarios, such as the case is for “low-demand” safety functions, a PFD of the human intervention lower than 10% should not be used.

Giving credit to human intervention in SIL allocation is good practice – but the credit given should be realistic based on what we know about how humans react in these situations. Due to the large uncertainty, especially when performing a “quick-and-dirty” shortcut analysis such as discussed above, conservative values for human error should be assumed.

Also note that when a human action is included as an “independent protection layer” in a LOPA, the integrity of the entire barrier system includes this action as well. This means that in order to have control over barrier integrity, the company must carefully manage the underlying factors such as organizational maturity, safety leadership and competence management. Increased attention also to these factors in internal hazard reviews could lead to improved safety performance; maybe could the number of accidents with human errors as root cause be significantly reduced through more structured inclusion of human elements in barrier management thinking.