When «risk reduction» kills

School shootings are not all that uncommon. In America, that is. At the same time, lots of Americans believe owning a gun can improve their safety – they buy guns to shoot at bad guys. In fact, the likelihood that your weapon is going to kill you is larger than the probability that your gun is going to kill a terrorist, bank robber or rapist. By orders of magnitude. So…. How many Americans have been killed by terrorists the last 10 years? Almost none. How many have been killed in various shootings? A lot. There is some serious disconnect going on here. In fact, vox.com has created a nice graphic to illustrate this (original here: http://www.vox.com/2015/10/1/9437187/obama-guns-terrorism-deaths).

What went wrong here? People are trying to protect themselves against a very low-frequency event by using a very dangerous tool. Human error is a dominant error cause in operation of any technology, guns included. Combined with lack of risk understanding and emergency response training, putting a gun in every paranoid ghost’s hand, is recipe for disaster. Using drastic means to curb risks can be necessary, but then the risk should warrant it. If we look at a risk matrix for any individual, and comparing a couple of situations here, we see that some risk controlling measures are best left unused.

Case 1: getting killed by terrorist. Probability: extremely small. Countermeasure: own gun and shoot terrorist. Likelihood of success of this mitigation attempt: extremely small.

Case 2: getting killed by gunman. Probabilty: relatively small. Countermeasure: reduce number of guns in society. Likelihood of success of this mitigation attempt: almost certain.

Then a single question remains: how is it possible that people do not make this connection and therefore block legislation that would reduce the number of guns around? People are prioritizing a very inefficient measure against an extremely unlikely event, instead of a very efficient measure against a likely event.

Get results by running fewer but better meetings

I’ve been to a lot of meetings – it is the battle ground of modern business. It is also where we make decisions, drive progress and get our priorities aligned. Most meetings, however, are just terrible energy drains. Bad meetings are bad for people, and they harm quality. It is not hard to claim that bad meetings are also bad for safety, if the workshops and meetings used to drive risk assessments and engineering activities are not well organized with a clear focus. Based on experience from a decade in meeting rooms, I’ve devised the following 5 rules of great meetings that I think are truly helpful.

Meetings can vary a lot in format and location - selecting the architecture of your meeting carefully is one of the rules for driving great meeting experiences.
Meetings can vary a lot in format and location – selecting the architecture of your meeting carefully is one of the rules for driving great meeting experiences.

Meeting Rule #1 If your meeting does not have a clear purpose, a specific agenda, and defined desired outcomes, the meeting shall not take place.

Meeting Rule #2 Carefully select attendees and share the purpose and agenda of your meeting with the attendees in advance, asking for feedback. Continue to foster debate and two-way interactions in the meeting.

Meeting Rule #3 Adapt architecture of meetings to the purpose, agenda and size of the meeting, by carefully selecting visual aids, meeting locations, duration and formality to fit your needs.

Meeting Rule #4 Stay close to the agenda to show that you value results, and at the same time give praise where praise is due both during the meeting and in the minutes. Make sure you make it very clear when you and your team have reached a desired outcome in your meeting.

Meeting Rule #5 Never invite to a meeting to drive outcomes you do not feel OK with from an ethical standpoint.

I’m very interested to hear what you think about these rules, and if you have other heuristics for making meetings work. I’m sure not all meetings I lead are great, but they are probably much better after I’ve realized a few things summarized in these rules, than they used be before. Tell me what you think in the comment field, or on Twitter (@sjefersuper).

Avoiding paralysis by analysis 

When faced with a difficult decision to make, it is easy to get stuck by the big want for more information. How do you avoid analyzing everything you do not know, or maybe just as important, how do you avoid paying a consultant for doing stuff you don’t need?

IMG_0988

Start by taking a step back. You should check your priorities and needs for information as a first step. What do you need to know, and what would you like to know? If you are aiming for the latter you may be in danger of paralysis. When your view on the matter has become clear you can start the path to the decision by asking questions:

  • What do I already know?
  • What do I need to fill in?
  • Do I have the right competence available?
  • What is the most efficient way to get what I need?

By asking and answering these questions, you are in a much better position, even when you need to use consultants- because you know what to ask for.

Your safety and human factors

Today I almost lost a wheel on my car while driving. How could this happen? This morning I went to the dealership to change tires on my car. I had to wait for the mechanic to come in to work in the morning because I was a bit early, and I didn’t really have an appointment, I just showed up. After a while the mechanic arrived and after some 15 minutes they gave me my keys back and told me “you’re good to go”. Happily I drove down to the office with new tires on the car.

bilmann

 A little later today I was going to see one of our clients, and I started driving. After a few kilometers I started to hear a banging noise from the back of my car. This had happened to me once before, several years back, but I immediately understood that one of the wheels had not been properly fastened and that I should stop right away. I pulled over and walked around the car – and yes, the bolts were loose – actually they were almost falling out by themselves! Obviously this was very dangerous, and I called the dealership and told them to come and finish the job on the roadside. They actually showed up after just a few minutes, and the guy coming over appoligized and tightened all the bolts. I suggested to him that they should consider having a second person check the work of the mechanic before handing the vehicle back to the owner. He agreed – and told me it was “the new guy” who’d changed the tires in the morning.

This was down to several factors – a new guy, maybe with insufficient training, and a client putting pressure on him to finish the job (I was inside the dealership drinking coffee – didn’t talk to the mechanic, but it may have been interpreted this way by him). Simple things affect our performance and can have grave consequences. Why did it not go wrong this time? Because I recognized the sound, I stopped in time before the wheel fell off – it was just luck in addition to experience with the same kind of problem. Humans can thus both be the initiating causes of accidents, and they can be barriers against accident.

Favourite boots for walking

My good colleague Ida has been reading my blog, and she told me it was great, but… it is lacking an important type of post that she was used to reading on other great blogs. So, in honor of Ida, I bring to you, a one-time only happening: the outfit of the day. I really promise never to do this again 😉

Here, the safety consultant is wearing McKinley hiking boots, pants from Nordheim, and a soft-shell from Craft that is a trusted ally on short hiking trips. Nothing is like starting a rainy summer day in the forest instead of the office!

Four golden practices for functional safety management

Managing functional safety activities and ensuring high integrity of instrumented barriers is not fundamentally different from other project management activities. This means that functional safety management should be integrated into the overall project planning, management and controlling activities. I will be presenting a paper written in cooperation with several colleagues at Lloyd’s Register Consulting at the next ESREL conference on this topic, but here is a sneak-peak at the four golden practices.

Golden practice 1 – Planning of functional safety should be a group activity involving all relevant organizations

Management of functional safety should be planned for the asset as a system, taking the whole lifecycle into account. Normally, the scope is split between a number of organizations and persons (owner, engineering, vendors, consultants, etc.). In order to plan activities and responsibilities such that it can be integrated into all these different organizations’ activities, a common planning session at the outset of a project is a good practice to coordinate activities and align priorities. Such a meeting should be facilitated by a competent functional safety expert. The results of functional safety planning should then be integrated into each organization’s project plan.

Golden practice 2 – Competence mapping and training development

Each company involved in the safety lifecycle shall have competence requirements for each role related to the work to be done. Mapping of competence of the employees should be performed in order to identify gaps, and training plans developed to make sure such gaps are closed. In assessing competence requirements, the factors described in Chapter 5 in the Norwegian Oil and Gas Association’s Guideline 070 should be used as a basis.

Golden practice 3
– Functional safety requirements in contracts

Include functional safety requirements in contracts across all interfaces, with clear descriptions of expected level of involvement, as well as deliverables such as hardware, software and documentation of such in accordance with project requirements. It should be included in the contract that all parties are required to prepare for and participate in audits and functional safety assessments as needed by the project. A simple reference to a standard may be legally binding but with only a simple standard reference it is unclear exactly what the priorities are and which activities each organization shall take care of.

Golden practice 4 – Constructive auditing

Consider need for audits of partners and vendors based on project risk (non-conformance risks, schedule risks and cost impact of such slips). If vendors have responsibility for development and engineering activities, auditing of these vendors should be considered. Functional safety audits should be integrated into the projects overall project plan.

Implementing the golden practices does not ensure a problem free project, but chances of high performance will certainly be improved by adopting these practices in your next project. Especially Golden Practice 1 – looking at functional safety planning as a cross-organizational activity is especially beneficial for establishing a common understanding and common goals for everyone involved.

The good and bad sides of proof testing

Testing is an integral part of operation and maintenance of equipment with a SIL rating. Testing is necessary to ensure the achieved integrity of the safety instrumented function is actually as intended. First of all, the mere assumption of a test frequency  (hours between each proof test, τ) as a direct impact on the calculated probability of failure on demand (PFD) with a given failure rate for dangerous undetected failures, λDU :

PFD =  λDU x (τ/2)

This function is valid for calculating the average PFD for a single component. For redundant configurations things become more complicated but let us stick to this one for the sake of simplicity. Obviously, if we cut the number of hours between tests in half, we cut the PFD in half. So, the more often we test, the better it is – right? No – Wrong!

Why is that wrong? Testing has two negative sides:

  1. It stops production, which means it stops cash flow, which means it costs money there and then
  2. It is a source of errors itself, either through increased wear on the system or more likely, a possibility of human error like forgetting to put a system back into automatic mode after testing is done

Of course, this does not mean that we should not test – that is absolutely necessary to make sure the safety function works. Also, over time we can use the results from testing of our functions in the SIS to check whether the assumed failure rates are correct. What it means is – we need to find the right balance between the good side and bad side of testing. In practice, annual testing is often used, and this may be a sweet spot for test frequencies? Sometimes engineers are tempted to increase the test frequency to avoid trouble with PFD numbers after they have bough inferior equipment. People working on the installations tend to strongly oppose this – and rightly so. Buy good components, and test with a reasonable frequency to minimize the impact of the bad things about testing.

Operating systems and safety critical software – does the OS matter?

Safety critical software must be developed in accordance with certain practices, using specific tools fit for the required level of reliabliity, as well as by an organization with the right competence and maturity for developing such software. These software components are often parts of barrier management systems; bugs may lead to removal of critical functionality that again can lead to an accident situation with physical consequences, such as death, injuries, release of pollutants to the environment and severe material damage.

It is therefore a key question whether such software should be able to run under consumer grade operating systems, not conforming to any reliability development practices? The primary argument why this should be allowed from some vendors is “proven in use”; that they have used the software under said operating system for so many operating hours without incident, such that they feel the system is safe as borne out of experience.

It immedately seems reasonable to put more trust in a system set up that has been field testet over time and shown the expected performance, than a system that has not been tested in the field. Most operarting systems are issued with known bugs in addition to unknown bugs, and a list of bugs will exist for patching. A prioritzation of criticality is made, and the patches are developed accordingly. For Linux systems this patching strategy may be somewhat less organized as development is more distributed and less managed; even a the kernel level. The problem is akin to the classical software security problem; if software with design flaws and bugs is released, any such flaws will be found when a vulnerability is found by externals, or an incident occurs that can show the flaw. The bug or flaw is always inherent in code, and typically stems from lack of good practices during design and code development. In theory, damage resulting from such bugs and flaws shall then be limited by patching the system. In the meantime it is thought that perimeter defences cancounteract the risk of a vulnerability being exploited (this argument may not even hold in the security situation). For bugs affecting the safety in the underlying system, this thinking is flawed because even a single accident may have unacceptable consequences – including loss of human life.

In reliability engineering it is disputed whether a “workflow oriented” or “reliability growth oriented” view on software development and reliability is the most fruitful. Both have their merits. The “ship and patch” thinking inherent in proven in use cases for software indicate a stronger belief in reliability growth concepts. These are models that try to link the number of faults per operating time of the software to duration of discovery period ; most of them are modeled as some form of a Poisson process. It is acknowledged that this is a probabilistic model of deterministic errors inherent in the software, and the stochastic element is whether the actual software states realizing the errors are visited during software execution.

Coming back to operating systems, we see that complexity of such systems have grown very rapidly over the last decades. Looking specifically at the Linux kernel, the development has been tracked over time. The first kernel had about 10.000 lines of code ( in 1990). For the development of kernel version 2.6.29 they added almost 10.000 lines of code per day. If a reliability growth concept is going to work for such a rapid growth in complexity, 10.000 lines of code must be analyzed daily and end up as completely bug-free – and to prove that it is necessary to test every software state for those 10.000 lines of code.

Some research exists to compare effect of coding practices. Microsoft stated in 1992 that they had about 10-20 errors per 1000 lines of code prior to testing and that about 0.5 errors per 1000 lines of code would exist in shipped products.

Compliant development gives no guarantees on flaw and bug-free software. The same goes for development following good security practices – vulnerabilities may still exist. These practices have, however, been developed to minimize the number of design flaws and bugs getting into the shipped product. Structured programming techniques have been shown to produce code with less than 0.1 defects per 1000 lines of code – basically by following a workflow oriented quality regime in tandem with testing. If we assume 0.5 errors per 1000 lines of code in the Linux kernel (the kernel is not the entire OS), we have an estimated 7500 undiscovered bugs in the shipped version of Linux kernel 3.2.

An international rating for security of operating systems exist, the EAL rating. Commercial grade systems have a rating of EAL 4, where as secure RTOS’s tend to be EAL5 (semiformally designed, verified and tested).

The summary seems to be that consumer grade OS’s for life critical automation systems is not the best of ideas – which is why we don’t see too many of them.

Obtaining the necessary documenation from vendors – why is it so hard?

When buying technical items with certain requirements – such as reliability requirements / SIL requirements – obtaining documentation showing that the requirements are all met can be a headache. Especially for SIL projects this can be the case, in particular if a particular vendor is not used to providing the correct documentation. When this is the case – what can be done to make sure we do not get delays and problems with non-conformity? The answer is as logical and straight-forward to state, as it is diffcult to implement. Experience has shown that some practices may make obtaining compliance documentation easier – and it is all about communication.

FIrst of all – vendors must be made aware of the requirements at the time they are bidding for the sale – and not only a reference to a standard, but an actual explanation of what it means and what is exepcted. The party selling something should really try to understand this by asking the right questions…. but they don’t always do that.. So, communicating with the vendor from early on is important. This can be done by including requirements in the purchase order or contract.

This is, of course, not enough. So, vendor follow-up should be part of the planning of the engineering activities – just like you would plan to spend resources on requirement setting or participating in FAT’s!

When this has been planned – it is time to step up and help the vendor. Provide a guideline with what they need to deliver of documentation. Provide training to make them understand what is being asked of them – if they do not have the sufficient level of understanding. Make sure the responsible engineer for following up the delivery is also in the know about these requirements. Then the two contacts can speak the same language.

The vendor needs to know what the actual requirements are. It is therefore good practice to develop and supply the safety requirement specification as early as possible. Of course – that may lead to later changes as more information becomes available, but as long as everyone is on board with that and changes are properly managed, including across interfaces, this is a much better situation than requirements coming to vendors too late.

When the vendor has access to the SRS, the follow-up process should be intensified. Ask regular questions about progress – this may be done in an informal manner if the business climate is right. Open up for questions and be a support to the vendor in this process. Keep the conversation going. And make sure you get the progress you need. Tools for “expediting” progress should first and foremost be on the carrot side – let the vendor see real business benefit and value from providing good documentation and fulfilling expectations. Carrots can be things such as better chance of repeat business, improved vendor competence and standing in the market and possibly the option to become a shortlisted or preferred supplier? In addition to a bag of carrots, it may be necessary to carry a stick when soft talking no longer works. Metaphorical sticks may be things like payments tied to documentation deliverables, banning of supplier from future projects and general reputation damage. Hopefully you can keep the sticks hidden in the golf bag.

This provides no guearantee – but thinking about this stuff is infintiely better than not doing anything. My experience is that a little help brings a lot of progress.

To package these thoughts into a more condensed form you may refer to the following infographic – which sums up what to focus on during planning, design and follow-up.

Infographic showing how to follow up with vendors to obtain the necessary documentation on SIL compliance.
Infographic showing how to follow up with vendors to obtain the necessary documentation on SIL compliance.