What does the IEC 61508 requirement to have a safetey management system mean for vendors?

All companies involved in the safety lifecycle are required to have a safety management system, according to IEC 61508. What the safety management process entails for a specific project is relatively clear from the standard, and is typicaly described in an overall functional safety management plan. It is, however, much less clear from the standard what is expected for a vendor producing a component that is used in a SIS, but that is a generic product rather than a specifically designed system for one particular situation.

For vendors, the safety management system should be extensive enough to support fulfillment of all four aspects of the SIL requirement the component is targeting:

  • Quantitative requirements (PFD/PFH)
  • Semi-quantitative and architectural requirements (HWFT, SFF, etc.)
  • Software requirements
  • Qualitative requirements (quality system, avoidance of systematic failures)
A great safety management system is tailored to maintain the safety integrity level capability of the product from all four perspectives. Maintaining this integrity requires a high-reliability organization, as well as knowledgable individuals.
A great safety management system is tailored to maintain the safety integrity level capability of the product from all four perspectives. Maintaining this integrity requires a high-reliability organization, as well as knowledgable individuals.

Quite often, system integrators and system owners experience challenges working with vendors. We’ve discussed this in previous posts, e.g. follow-up of vendors. Based on experience from several sides of the table, the following parts of a safety management system are found to be essential:

  • A good system for receiving feedback and using experience data to improve the product
  • Clear role descriptions, competence requirements and a training system to make sure all employees are qualified for their roles
  • A good change management system, ensuring impact of changes is looked at from several angles
  • A quality system that ensures continuous imrovement can occur, and that such processes are documented
  • A documentation system that ensures the capabilities of the product can be documented in a trusted way, taking all changes into account in a transparent manner

A vendor that has such systems in place will have a much greater chance of delivering top quality products – than a vendor that only focuses on the technology itself. Ultra-reliable products require great organizations to stay ultra-reliable throughout the entire lifecycle.

Would you operate a hydraulic machine with no cover or emergency stop?

Sounds like a crazy idea, right? Research, however, has shown that about half of professional machine operators do not think safety functions are necessary. You know, things like panel limit switches, torque limiters and emergency stop buttons. Who would need that, right? This number comes from a report issued in Germany about 10 years ago, but I am not very optimistic in improvements in these numbers since then. The report can be found here: http://www.dguv.de/ifa/Publikationen/Reports-Download/BGIA-Reports-2005-bis-2006/Report-Manipulation-von-Schutzeinrichtungen/index.jsp (in German).

Machine safety is important for both users and bystanders. Manipulations of safety functions are common – and the risk increase is typically unknown to users and others. How can we avoid putting our people at risk due to degraded safety of our machinery?

Researchers have found that the safety functions of machines are frequently manipulated. This is typically done because workers perceive the manipulation as necessary to perform work, or to improve productivity. Everyone from machine builders, to purchasers to operrators should take this into account, to avoid accidents from happening. Consider for example a limit switch. A machine built to conform to the machinery directive (with CE marking) has to satisfy safety standards. Perhaps has a SIL 2 requirement been assigned to the limit switch because operation without it is deemed dangerous and a 100-fold risk reduction is necessary for operation to be acceptable. This means, if the limit switch is put out of function, the risk of operation is 100 times higher than the designer has intended!

What can we do about this? We need to design machine such that safety functions become part of the work flow – not an obstacle to it. If workers have nothing to gain in their own perception from manipulating the machine, they are not likely to do it either. This boils down to something we are aware of, but are not good enough at taking into account in design processes; usability testing is essential not only to make sure operators are happy with the ergonomics – it is also essential for the safety of the people using the machine!

Why functional safety audits are useful

Functional safety work usually involves a lot of people, and multiple organizations. One key success factor for design and operation of safety instrumented systems is the competence of the people involved in the safety lifecycle. In practice, when activities have been omitted, or the quality of the work is not acceptable, this is discovered in the functional safety assessment towards the end of the project, or worse, it is not discovered at all. The result of too low quality is lower integrity of the SIS as a barrier, and thus higher risk to people, assets and environment – without the asset owner being aware of this! Obviously a bad situation.

Knowing how to do your job is important in all phases of the safety lifecycle. Functional safety audits can be an important tool for verification – and for motivating organization’s to maintain good competence mangement systems for all relevant roles.

Competence management is a key part of functional safety management. In spite of this, many companies have less than desirable track records in this field. This may be due to ignorance, or maybe because some organizations view a «SIL» as a marketing designation rather than a real risk reduction measure. Either way – such a situation is unacceptable. One key tool for ensuring everybody involved understands what their responsibilities are, and makes an effort to learn what they need to know to actually secure the necessary system level integrity, is the use of functional safety audits. An auditing program should be available in all functional safety projects, with at least the following aspects:

  • A procedure for functional safety audits should exist
  • An internal auditing program should exist within each company involved in the safety lifecycle
  • Vendor auditing should be used to make sure suppliers are complying with functional safety requirements
  • All auditing programs should include aspects related to document control, management of change and competence management

Constructive auditing can be an invaluable part of building a positive organizational culture – where quality becomes as important to every function involved in the value chain – from the sales rep to the R&D engineer.

One day statements like “please take the chapter on competence out of the management plan, we don’t want any difficult questions about systems we do not have” may seem like an impossible absurdity.

Stages of process safety understanding

Defining process safety should be quite straightforward. However, what people mean with this term can vary quite a lot, and what to include in the term depends a lot on the understanding people have of the anatomy of severe accidents. Personally, I have met the following different understandings of the topic:

  • Process safety is what is governed by API 521 (basically steel strength and dimensioning of pressure relief valves)
  • Process safety is the technical measures taken to stop an accident from occurring
  • Process safety is the sum of organizational and technical systems involved in mitigating risk of major accidents

The first statement is obviously too narrow – especially as we know that more than half of accidents are down to human factors! Definition number 2 is a traditional view, and slightly more mature as it includes both the safety instrumented system and alarm management (to a certain extent). The last definition is maybe the most “modern”, and includes organizational culture, safety leadership as well as the technologies included in the first and second definitions.

How people understand the term “process safety” tends to mature over time – from a strictly technical view to a more holistic view including both individual and organizational factors, as well as the technologies and how they are used in a system. A walk up this staircase from the technology focused to a more holistic view can take a long time but conscious reflection can help speed the path to improved performance and risk management.

A complete understanding of barrier systems, which is really what risk management is about, requires an understanding of which factors are influencing accident risk, and what can be done to mitigate the risk. This requires that the asset owner thinks not only about “proof testing”, “compliance” or “asset management”, but also about:

  • Leadership
  • Barrier integrity
  • Maintenance
  • Monitoring
  • Design
  • Competence management
  • Permit to work system
  • Dynamics of plant and controls in normal and degraded modes
  • Etc, etc, etc.

In other words – to keep risk under control you need to take the full complexity of your operations into account. A purely technical view on process safety is thus simply not good enough.

Your safety and human factors

Today I almost lost a wheel on my car while driving. How could this happen? This morning I went to the dealership to change tires on my car. I had to wait for the mechanic to come in to work in the morning because I was a bit early, and I didn’t really have an appointment, I just showed up. After a while the mechanic arrived and after some 15 minutes they gave me my keys back and told me “you’re good to go”. Happily I drove down to the office with new tires on the car.

bilmann

 A little later today I was going to see one of our clients, and I started driving. After a few kilometers I started to hear a banging noise from the back of my car. This had happened to me once before, several years back, but I immediately understood that one of the wheels had not been properly fastened and that I should stop right away. I pulled over and walked around the car – and yes, the bolts were loose – actually they were almost falling out by themselves! Obviously this was very dangerous, and I called the dealership and told them to come and finish the job on the roadside. They actually showed up after just a few minutes, and the guy coming over appoligized and tightened all the bolts. I suggested to him that they should consider having a second person check the work of the mechanic before handing the vehicle back to the owner. He agreed – and told me it was “the new guy” who’d changed the tires in the morning.

This was down to several factors – a new guy, maybe with insufficient training, and a client putting pressure on him to finish the job (I was inside the dealership drinking coffee – didn’t talk to the mechanic, but it may have been interpreted this way by him). Simple things affect our performance and can have grave consequences. Why did it not go wrong this time? Because I recognized the sound, I stopped in time before the wheel fell off – it was just luck in addition to experience with the same kind of problem. Humans can thus both be the initiating causes of accidents, and they can be barriers against accident.

A quest for knowledge – and the usefulness of your HR department in functional safety management

Most firms claim that their people is the most important asset. If this has any effect on operations, is another thing – some actually mean it and others seem not to do so much about keeping their people well-equipped for the tasks they need  to do.

knowledge_management

When it comes to functional safety, competence management is a very important part of the game. In many projects, one of the major challenges is related to getting the right information and documentation from suppliers. Why is this so difficult? It comes down to work processes, communication and knowledge, as discussed in a previous post. One requirement common to IEC 61508 and 61511 is that every role involved in the safety lifecycle should be competent to perform his or her function. In practice, this is only fulfilled if each of these roles have clear competence requirement descriptions and expectations, a description of how competence will be assessed, and how knowledge will be created for those roles.

There are many ways of training your people, and this is a huge part of the field of HR. Most likely, people in your company’s HR functions actually know a great deal about planning, organizing and executing competence development programs. Involving them in your functional safety management planning can thus be a good idea! A few key issues to think about:

  • What are the requirements for your key roles?
  • What are your key roles (package engineer, procurement specialist(!), instrument engineer, project manager, etc., etc.)?
  • How do you check if they have the right competence? (peer assessment, tests, interviews, experience, etc.)?
  • What training resources do you have available? (Courses, e-learning, on-the-job-training, self-study, etc.)?
  • How often do you need to reassess competence?
  • Who is responsile for this system? (HR, project manager, functional safety engineer, etc.)?

A firm that has this firmly in place will most likely be able to steer their supply chain and help them also gain confidence and knowledge – vastly improving communication across interfaces and thereby also the quality of cross-organizational work.

How independent should your FSA leader be?

Functional safety assessment is a mandatory 3rd party review/audit for functional safety work, and is required by most reliability standards. In line with good auditing practice, the FSA leader should be independent of the project development. Exactly what does this mean? Practice varies from company to company, from sector to sector and even from project to project. It seems reasonable to require a greater degree of independence for projects where the risks managed through the SIS are more serious. IEC 61511 requires (Clause 5.2.6.1.2) that functional safety assessments are conducted with “at least one senior competent person not involved in the project design team”. In a note to this clause the standard remarks that the planner should consider the independence of the assessment team (among other things). This is hardly conclusive.

If we go to the mother standard IEC 61508, requirements are slightly more explicit, as given by Clause 8.2.15 of IEC 61508-1:2010, which states that the level of independence shall be linked to perceived consequence class and required SILs. For major accident hazards, two categories are used in IEC 61508:

  • Class C: death to several people
  • Class D: very many people killed

For class C the standard accepts the use of an FSA team from “independent department”, whereas for class D only an “independent organization” is acceptable. Further, also for class C, an independent organization should be used if the degree of complexity is high, the design is novel or the design organization is lacking experience with this particular type of design. There are also requirements based on systematic capability in terms of SIL but those are normally less stringent in the context of industrial processes than the consequence based requirements to FSA team independence. The standard also specifies that compliance to sector specific standards, such as IEC 61511, would make a different basis for consideration of independence acceptable.

In this context, the definitions of “independent department” and “independent department” are given in Part 4 of the standard. An independent department is separate from and distinct from departments responsible for activities which take place during the specified phase of the overall system or software lifecycle subject to the validation activity. This means also, that the line managers of those departments should not be the same person. An independent organization is separate by management and other resources from the organizations responsible for activities that take place during the lifecycle phase. This means, in practice, that the organization leading a HAZOP or LOPA should not perform the FSA for the same project if there are potential major accident hazards within the scope, and preferably also not if there are any significant fatal accident risks in the project. Considering the requirement of separate management and resource access, it is not a non-conformity if two different legal entities within the same corporate structure perform the different activities, provided they have separate budgets and leadership teams.

If we consider another sector specific standard, EN 50129 for RAMS management in the European railway sector, we see that similar independence requirements exist for third-party validation activities. Figure 6 in that standard seemingly allows the assessor to be a part of the same organization as an organization involved in SIS development, but further requires for this situation that the assessor has an authorization from the national safety authority, is completely independent form the project team and shall report directly to the safety authorities. In practice the independent assessor is in most cases from an independent organization.

It is thus highly recommended to have an FSA team from a separate organization for all major SIS developments intended to handle serious risks to personnel; this is in line with common auditing practice in other fields.

Why is this important? Because we are all humans. If we feel ownership to a certain process, product or affiliation with an organization, it will inevitably be more difficult for us to point out what is not so good. We do not want to hurt people we work with by stating that their work is not good enough – even if we know that inferior quality in a safety instrumented system may actually lead to workers getting killed at work later. If we look to another field with the same type of challenges but potentially more guidance on independence, we can refer to the Sarbanes-Oxley act of 2002 from the United States. The SEC has issued guidelines about auditor independence and what should be assessed. Specifically they include:

  1. Will a relationship with the auditor create a mutual or conflicting interest with their audit client?
  2. Will the relationship place the auditor in the position of auditing his/her own work?
  3. Will the relationship result in the auditor acting as management or an employee of the audit client?
  4. Will the relationship result in the auditor being put in a position where he/she will act as an advocate for the audit client?

It would be prudent to consider at least these questions if considering using an organization that is already involved in the lifecycle phase subject to the FSA.

Electrical isolation of ignition sources on offshore installations

triangle-of-fireOne of the typical major accident scenarios considered when building and operating an offshore drilling or production rig, is a gas leak that is ignited, leading to a jet fire, or even worse, an explosion. For this scenario to happen we need three things (from the fire triangle):

  • Flammable material (the gas)
  • Oxygen (air)
  • An ignition source

The primary protection against such accidents is containment of flammable materials; avoiding leaks is the top safety priority offshore. As a lot of this equipment exists outdoors (or in “naturally ventilated areas” as standards tend to call it), it is not an option to remove “air”. Removing the ignition source hence becomes very important, in the event that you have a gas leak. The technical system used to achieve this consists of a large number of distributed gas detectors on the installation, sending message of detected gas to a controller, which then sends a signal to shut down all potential ignition sources (ie non-EX certified equipment, see the ATEX directive for details).

This being the barrier between “not much happening” and “a major disaster”, the reliability of this ignition source control is very important. Ignition sources are normally electrical systems not designed specifically to avoid ignition (so-called EX certified equipment). In order to have sufficient reliability of this set-up the number of ignition sources should be kept at a minimum; this means that the non-EX equipment should be grouped in distribution boards such that an incomer breaker can be used to isolate the whole group, instead of doing it at the individual consumer level. This is much more reliable, as the probability of a failure on demand (PFD) will contain an additive term for each of the breakers included:

PFD = PFD(Detector) + PFD(Logic) + Sum of PFD of each breaker

Consider a situation where you have 100 consumers, and the dangerous undetected failure rate for the breakers used is 10-7 failures per hour of operation, with testing every 24 months, the contribution from a single breaker is

PFD(Breaker) = 10-7 x (8760 x 2) / 2 = 0.000876

If we then have 6 breakers that need to open for full isolation, we have a breaker PFD contribution of 0.005 from the breakers (which means that with reliable gas detectors and logic solver, a full loop can satisfy a SIL 2 requirement). If we have 100 breakers the contribution to PFD is 0.08 – and the best we can hope for is SIL 1.

What is the difference between software and hardware failures in a reliability context?

Reliability engineers have traditionally focused more on hardware than software. There are many reasons for this; one reason is that traditionally safety systems have been based on analog electronics, and although digitial controls and PLC’s have been introduced throughout the 1990’s, the actual software involved was in the beginning very simple. Today the situation has really changed, but the focus in reliability has not completely taken this onboard. One of the reasons may be that reliability experts like to calculate probabilities – which they are very good at doing for hardware failures. Hardware failures tend to be random and can be modeled quite well using probabilistic tools. So – what about software? The failure mechanisms are very different – as failures in hardware are related to more or less stochastic effects stemming from load cycling, material defects and ageing, software defects or completely deterministic (we disregard stochastic algorithms here – they are banned from use in safety critical control system anyway).

Software defects exist for two reasons: design errors (flaws) and implementation errors (bugs). These errors may occur at the requirement stage or during actual coding, but irrespective of the time they occur, they are always static. They do not suddenly occur – they are latent errors hidden within the code – that will active each and every time the software state where the error is relevant is visited.

Such errors are very difficult to include in a probabilistic model. That is why reliability standards prescribe a completely different medicine; a process oriented framework that gives requirements to management, choice of methods and tools, as well as testing and documentation. These quality directed workflows and requirements are put in place such that we should have some confidence in the software not being a significant source of unsafe failures of the critical control system.

Hence – process verification and auditing take the place of probability calculations when we look at the software. In order to achieve the desired level of trust it is very important that these practices are not neglected in the functional safety work. Deterministic errors may be just as catastrophic as random ones – and therefore they must be managed with just as much rigor and care. The current trend is that more and more functionality is moved from hardware to software – which means that software errors are becoming increasingly important to manage correctly if we are not going to degrade both performance and trust of the safety instrumented systems we rely on to protect our lives, assets and the environment.

Does safety engineering require security engineering?

Safetey critical control systems are developed with respect to reliability requirements, often following a reliability standard such as IEC 61508 or CENELEC EN 50128. These standards put requirements on development practices and activities with regard to creating software that works the way it is intended based on the expected input, and where availability and integrity is of paramount importance. However, these standards do not address information security. Some of the practices required from reliability standards do help in removing bugs and design flaws – which to a large extent also removes security vulnerabilites – but they do not explicitly express such conceerns. Reliability engineering is about building trust into the intended functionality of the system. Security is about lack of unintended functionality.

Consider a typical safety critical system installed in an industrial process, such as an overpressure protection system. Such a system may consist of a pressure transmitter, a logic unit (ie a computer) and some final elements. This simple system meausres the pressure  and transmits it to the computer, typically over a hardwired analog connection. The computer then decides if the system is within a safe operating region, or above a set point for stopping operation. If we are in the unsafe region, the computer tells the final element to trip the process, for example by flipping an electrical circuit breaker or closing a valve. Reliability standards that include software development requirements focus on how development is going to work in order to ensure that whenever the sensor transmits pressure above the threshold, the computer will tell the process to stop. Further the computer is connected over a network to an engineering station which is used for such things as updating the algorithm in the control system, changing the threshold limits, etc.

What if someone wants to put the system out of order, without anyone noticing? The software’s access control would be a crucial barrier against anyone tampering with the functionality. Reliability standards do not talk about how to actually avoid weak authentication schemes, although they talk about access management in general. You may very well be compliant with the reliability standard – yet have very weak protection against compromising the access control. For example, the coder may very well use a “getuser()” call  in C in the authentication part of the software – without violating the reliability standard requirements. This is a very unsecure way of getting user credentials from the computer and should generally be avoided. If such a practice is used, a hacker with access to the network could with relaitve ease  get admin access to the system and change for example set points, or worse, recalibrate the pressure  sensor to report wrong readings – something that was actually done in the Stuxnet case.

In other words – as long as someone can be interested in harming your operation – your safety system needs security built-in, and that is not coming for free through reliability engineering. And there is always someone out to get you – for sports, for money or just because they do not like you. Managing security is an important part of managing your business risk – so do not neglect this issue while worrying only about reliability of intended functionality.