Teaching smart things cyber self defense: ships and cars that fight back

We have connected everything to the Internet – from power plants to washing machines, from watches with fitness trackers to self-driving cars and even self-driving gigantic ships. At the same time, we struggle to defend our IT systems from criminals and spies. Every day we read about data breaches and cyber attacks. Why are we then not more afraid of cyber attacks on the physical things we have put into our networks?

  • Autonomous cars getting hacked – what if they crash into someone?
  • Autonomous ships getting hacked – what if they lose stability and sink due to a cyber attack or run over a small fishing boat?
  • Autonomous light rail systems – what if they are derailed at high speed due to a cyber attack?

Luckily, we are not reading about things like this in the news, at least not very often. There have been some car hacking mentioned, usually demos of possibilities. But when we build more and more of these smart systems that can cause disasters of control is lost, shouldn’t we think more about security when we build and operate them? Perhaps you think that someone must surely be taking care of that. But fact is, in many cases, it isn’t really handled very well.

How can an autonomuos vessel defend against cyber attacks?

What is the attack surface of an autonomous system?

The attack surface of an autonomous system may of course vary, but they tend to have some things in common:

  • They have sensors and actuators communicating with a “brain” to make decisions about the environment they operate in
  • They have some form of remote access and support from a (mostly) human operated control center
  • They require systems, software and parts from a high number of vendors with varying degree of security maturity

If we for the sake of the example consider an autonomous battery powered vessel at sea, such as ferry. Such a vessel will have multiple operating modes:

  • Docking to the quay
  • Undocking from the quay
  • Loading and offloading at quay
  • Journey at sea
  • Autonomous safety maneuvers (collision avoidance)
  • Autonomous support systems (bilge, ballast, etc)

In addition there will typically be a number of operations that are to some degree human led, such as search and rescue if there is a man over board situation, firefighting, and other operations, depending on the operating concept.

To support the operations required in the different modes, the vessel will need an autonomous bridge system, an engine room able to operate without a human engineer in place to maintain propulsion, and various support systems for charging, mooring, cargo handling, etc. This will require a number of IT components in place:

  • Redundant connectivity with sufficient bandwidth (5G, satellite)
  • Local networking
  • Servers to run the required software for the ship to operate
  • Sensors to ensure the ship’s autonomous system has good situational awareness (and the human onshore operators in the support center)

The attack surface is likely to be quite large, including a number of suppliers, remote access systems, people and systems in the remote control center, and remote software services that may run in private data centers, or in the cloud. The main keyword here is: complexity.

Defending against cyber piracy at high seas

With normal operation of the vessel, its propulsion and bridge systems would not depend on external connectivity. Although cyber attacks can also hit conventional vessels, much of the damage can be counteracted by seafarers onboard taking more manual control of the systems and turning off much of the “smartness”. With autonomous systems this is not always an option, although there are degrees of autonomy and it is possible to use similar mechanisms if the systems are semi-autonomous with people present to take over in case of unusual situations. Let’s assume the systems are fully autonomous and there is nobody onboard to take control of them.

Since there are no people to compensate for digital systems working against us, we need to teach the digital systems to defend themselves. We can apply the same structural approach to securing autonomous systems, as we do to other IT and OT systems; but we cannot rely on risk reduction from human local intervention. If we follow “NSM’s Grunnprinsipper for IKT-sikkerhet” (the Norwegian government’s recommendations for IT security, very similar to NIST’s cybersecurity framework), we have the following key phases:

  1. Identify: know what you have and the security posture of your system
  2. Protect: harden your systems and use security technologies to stop attackers
  3. Detect: set up systems so that cyber attacks can be detected
  4. Respond: respond to contain compromised systems, evict intruders, recover capabilities, improve hardening and return to normal operations

These systems are also operational technologies (OT). It may therefore be useful also to refer to IEC 62443 in the analysis of the systems, especially to assess the risk to the system, assign requires security levels and define requirements. Also the IEC 62443 reference architecture is useful.

It is not so that all security systems have to be working completely autonomously for an autonomous system, but it has to be more automated in a normal OT system, and also in most IT systems. Let’s consider what that could mean for a collision avoidance system on an autonomous vessel. The job of the collision avoidance system can be defined as follows:

  1. Detect other vessels and other objects that we are on collision course with
  2. Detect other objects close-by
  3. Choose action to take (turn, stop, reverse, alert on-shore control center, communicate to other vessels over radio, etc)
  4. Execute action
  5. Evaluate effect and make corrections if needed

In order to do this, the ship has a number of sensors to provide the necessary situational awareness. There has been a lot of research into such systems, especially collaborative systems with information exchange between vessels. There have also been pilot developments, such as this one https://www.maritimerobotics.com/news/seasight-situational-awareness-and-collision-avoidance by the Norwegian firm Maritime Robotics.

We consider a simplified view of how the collision avoidance system works. Sensors tell the anti collision system server about what it sees. The traffic is transmitted over proprietary protocols, some over tcp, some over udp (camera feeds). Some of the traffic is not encrypted, but all is transferred over the local network. The main system server is processing the data onboard the ship and making decisions. Those decisions go to functions in the autonomous bridge to take action, including sending radio messages to nearby ships or onshore. Data is also transmitted to onshore control via the bridge system. Onshore can use remote connection to connect to the collision avoidance server directly richer data, as well as overriding or configuring the system.

Identify

The system should automatically create a complete inventory of its hardware, software, networks, and users. This inventory must be available for automated decision making about security but also for human and AI agents working as security operators from onshore.

The system should also automatically keep track of all temporary exceptions and changes, as well as any known vulnerabilities in the system.

In other words: a real-time security posture management system must be put in place.

Protect

An attacker may wish to perform different types of actions on this vessel. Since we are only looking at the collision avoidance system here we only consider an adversary that wants to cause an accident. Using a kill-chain approach to our analysis, the threat actor thus has the following tasks to complete:

  • Recon: get an overview of the attack surface
  • Weaponization: create or obtain payloads suiteable for the target system
  • Delivery: deliver the payloads to the systems. Here the adversary may find weaknesses in remote access, perform a supply-chain attack to deliver a flawed update, use an insider to gain access, or compromise an on-shore operator with remote access privileges.
  • Execution: if a technical attack, automated execution will be necessary. For human based attacks, operator executed commands will likely be the way to perform malware execution.
  • Installation: valid accounts on systems, malware running on Windows server
  • Command and control: use internet connection to remotely control the system using installed malware
  • Actions on objectives: reconfigure sensors or collision avoidance system by changing parameters, uploading changed software versions, or turning the system off

If we want to protect against this, we should harden our systems as much as possible.

  • All access should require MFA
  • Segregate networks as much as possible
  • Use least privilege as far as possible (run software as non-privileged users)
  • Write-protect all sensors
  • Run up-to-date security technologies that block known malware (firewalls, antivirus, etc)
  • Run only pre-approved and signed code, block everything else
  • Remote all unused software from all systems, and disable built-in functionality that is not needed
  • Block all non-approved protocols and links on the firewall
  • Block internet egress from endpoints, and only make exceptions for what is needed

Doing this will make it very hard to compromise the system using regular malware, unless operations are run as an administrator that can change the hardening rules. It will most likely protect against most malware being run as an administrator too, if the threat actor is not anticipating the hardening steps. Blocking traffic both on the main firewall and on host based firewalls, makes it unlikely that the threat actor will be able to remove both security controls.

Detect

If an attacker manages to break into the anti-collision system on our vessel, we need to be able of detecting this fast, and responding to it. The autonomous system should ideally perform the detection on its own, without the need for a human analyst due to the need for fast response. Using human (or AI agents) onshore in addition is also a good idea. As a minimum the system should:

  • Log all access requests and authorization requests
  • Apply UEBA (user entity behavior analysis) to detect an unusual activity
  • Use advanced detection technologies such as security features of a NGFW, a SIEM with robust detection rules, thorough audit logging on all network equipment and endpoints
  • Use EDR technology to provide improved endpoint visibility
  • Receive and use threat intelligence in relevant technologies
  • Use deep packet inspection systems with protocol interpreters for any OT systems part of the anti-collision system
  • Map threat models to detection coverage to ensure anticipated attacks are detectable

By using a comprehensive detection approach to cyber events, combined with a well-hardened system, it will be very difficult for a threat actor to take control of the system unnoticed.

Respond and recover

If an attack is detected, it should be dealt with before it can cause any damage. It may be a good idea to conservatively plan for physical response also for an autonomous ship with a cybersecurity intrusion detection, even if the detection is not 100% reliable, especially for a safety critical system. A possible response could be:

  • Isolate the collision avoidance system from the local network automatically
  • Stop the vessel and maintain position (using DP if available and without security detections, and as a backup to drop anchor)
  • Alert nearby ships over radio that “Autonomous ship has lost anti-collision system availability and is maintaining position. Please keep distance. “
  • Alert onshore control of the situation.
  • Run system recovery

System recovery could entail securing forensic data, automatically analysing data for indicators of compromise and identify patient zero and exploitation path, expanding blast radius to continue analysis through pivots, reinstall all affected systems from trusted backups, update configurations and harden against exploitation path if possible, perform system validation, transfer back to operations with approval from onshore operations. Establishing a response system like this would require considerable engineering effort.

An alternative approach is to maintain position, and wait for humans to manually recover the system and approve returning to normal operation.

The development of autonomous ships, cars and other high-risk applications are subject to regulatory approval. Yet, the focus of authorities may not be on cybersecurity, and the competence of those designing the systems as well as the ones approving them may be stronger in other areas than cyber. This is especially true for sectors where cybersecurity has not traditionally been a big priority due to more manual operations.

A cyber risk recipe for people developing autonomous cyber-physical systems

If we are going to make a recipe for development of responsible autonomous systems, we can summarize this in 5 main steps:

  • Maintain good cyber situational awareness. Know what you have in your systems, how it works, and where you are vulnerable – and also keep track of the adversary’s intentions and capabilities. Use this to plan your system designs and operations. Adapt as the situation changes.
  • Rely on good practice. Use IEC 62443 and other know IT/OT security practices to guide both design and operation.
  • Involve the suppliers and collaborate on defending the systems, from design to operations. We only win through joint efforts.
  • Test continuously. Test your assumptions, your systems, your attack surface. Update defenses and capabilities accordingly.
  • Consider changing operating mode based on threat level. With good situational awareness you can take precautions when the threat level is high by reducing connectivity to a minimum, moving to lower degree of autonomy, etc. Plan for high-threat situations, and you will be better equipped to meet challenging times.

What does the IEC 61508 requirement to have a safetey management system mean for vendors?

All companies involved in the safety lifecycle are required to have a safety management system, according to IEC 61508. What the safety management process entails for a specific project is relatively clear from the standard, and is typicaly described in an overall functional safety management plan. It is, however, much less clear from the standard what is expected for a vendor producing a component that is used in a SIS, but that is a generic product rather than a specifically designed system for one particular situation.

For vendors, the safety management system should be extensive enough to support fulfillment of all four aspects of the SIL requirement the component is targeting:

  • Quantitative requirements (PFD/PFH)
  • Semi-quantitative and architectural requirements (HWFT, SFF, etc.)
  • Software requirements
  • Qualitative requirements (quality system, avoidance of systematic failures)

A great safety management system is tailored to maintain the safety integrity level capability of the product from all four perspectives. Maintaining this integrity requires a high-reliability organization, as well as knowledgable individuals.
A great safety management system is tailored to maintain the safety integrity level capability of the product from all four perspectives. Maintaining this integrity requires a high-reliability organization, as well as knowledgable individuals.

Quite often, system integrators and system owners experience challenges working with vendors. We’ve discussed this in previous posts, e.g. follow-up of vendors. Based on experience from several sides of the table, the following parts of a safety management system are found to be essential:

  • A good system for receiving feedback and using experience data to improve the product
  • Clear role descriptions, competence requirements and a training system to make sure all employees are qualified for their roles
  • A good change management system, ensuring impact of changes is looked at from several angles
  • A quality system that ensures continuous imrovement can occur, and that such processes are documented
  • A documentation system that ensures the capabilities of the product can be documented in a trusted way, taking all changes into account in a transparent manner

A vendor that has such systems in place will have a much greater chance of delivering top quality products – than a vendor that only focuses on the technology itself. Ultra-reliable products require great organizations to stay ultra-reliable throughout the entire lifecycle.

Would you operate a hydraulic machine with no cover or emergency stop?

Sounds like a crazy idea, right? Research, however, has shown that about half of professional machine operators do not think safety functions are necessary. You know, things like panel limit switches, torque limiters and emergency stop buttons. Who would need that, right? This number comes from a report issued in Germany about 10 years ago, but I am not very optimistic in improvements in these numbers since then. The report can be found here: http://www.dguv.de/ifa/Publikationen/Reports-Download/BGIA-Reports-2005-bis-2006/Report-Manipulation-von-Schutzeinrichtungen/index.jsp (in German).

Machine safety is important for both users and bystanders. Manipulations of safety functions are common – and the risk increase is typically unknown to users and others. How can we avoid putting our people at risk due to degraded safety of our machinery?

Researchers have found that the safety functions of machines are frequently manipulated. This is typically done because workers perceive the manipulation as necessary to perform work, or to improve productivity. Everyone from machine builders, to purchasers to operrators should take this into account, to avoid accidents from happening. Consider for example a limit switch. A machine built to conform to the machinery directive (with CE marking) has to satisfy safety standards. Perhaps has a SIL 2 requirement been assigned to the limit switch because operation without it is deemed dangerous and a 100-fold risk reduction is necessary for operation to be acceptable. This means, if the limit switch is put out of function, the risk of operation is 100 times higher than the designer has intended!

What can we do about this? We need to design machine such that safety functions become part of the work flow – not an obstacle to it. If workers have nothing to gain in their own perception from manipulating the machine, they are not likely to do it either. This boils down to something we are aware of, but are not good enough at taking into account in design processes; usability testing is essential not only to make sure operators are happy with the ergonomics – it is also essential for the safety of the people using the machine!

Why functional safety audits are useful

Functional safety work usually involves a lot of people, and multiple organizations. One key success factor for design and operation of safety instrumented systems is the competence of the people involved in the safety lifecycle. In practice, when activities have been omitted, or the quality of the work is not acceptable, this is discovered in the functional safety assessment towards the end of the project, or worse, it is not discovered at all. The result of too low quality is lower integrity of the SIS as a barrier, and thus higher risk to people, assets and environment – without the asset owner being aware of this! Obviously a bad situation.

Knowing how to do your job is important in all phases of the safety lifecycle. Functional safety audits can be an important tool for verification – and for motivating organization’s to maintain good competence mangement systems for all relevant roles.

Competence management is a key part of functional safety management. In spite of this, many companies have less than desirable track records in this field. This may be due to ignorance, or maybe because some organizations view a «SIL» as a marketing designation rather than a real risk reduction measure. Either way – such a situation is unacceptable. One key tool for ensuring everybody involved understands what their responsibilities are, and makes an effort to learn what they need to know to actually secure the necessary system level integrity, is the use of functional safety audits. An auditing program should be available in all functional safety projects, with at least the following aspects:

  • A procedure for functional safety audits should exist
  • An internal auditing program should exist within each company involved in the safety lifecycle
  • Vendor auditing should be used to make sure suppliers are complying with functional safety requirements
  • All auditing programs should include aspects related to document control, management of change and competence management

Constructive auditing can be an invaluable part of building a positive organizational culture – where quality becomes as important to every function involved in the value chain – from the sales rep to the R&D engineer.

One day statements like “please take the chapter on competence out of the management plan, we don’t want any difficult questions about systems we do not have” may seem like an impossible absurdity.

Stages of process safety understanding

Defining process safety should be quite straightforward. However, what people mean with this term can vary quite a lot, and what to include in the term depends a lot on the understanding people have of the anatomy of severe accidents. Personally, I have met the following different understandings of the topic:

  • Process safety is what is governed by API 521 (basically steel strength and dimensioning of pressure relief valves)
  • Process safety is the technical measures taken to stop an accident from occurring
  • Process safety is the sum of organizational and technical systems involved in mitigating risk of major accidents

The first statement is obviously too narrow – especially as we know that more than half of accidents are down to human factors! Definition number 2 is a traditional view, and slightly more mature as it includes both the safety instrumented system and alarm management (to a certain extent). The last definition is maybe the most “modern”, and includes organizational culture, safety leadership as well as the technologies included in the first and second definitions.

How people understand the term “process safety” tends to mature over time – from a strictly technical view to a more holistic view including both individual and organizational factors, as well as the technologies and how they are used in a system. A walk up this staircase from the technology focused to a more holistic view can take a long time but conscious reflection can help speed the path to improved performance and risk management.

A complete understanding of barrier systems, which is really what risk management is about, requires an understanding of which factors are influencing accident risk, and what can be done to mitigate the risk. This requires that the asset owner thinks not only about “proof testing”, “compliance” or “asset management”, but also about:

  • Leadership
  • Barrier integrity
  • Maintenance
  • Monitoring
  • Design
  • Competence management
  • Permit to work system
  • Dynamics of plant and controls in normal and degraded modes
  • Etc, etc, etc.

In other words – to keep risk under control you need to take the full complexity of your operations into account. A purely technical view on process safety is thus simply not good enough.

Your safety and human factors

Today I almost lost a wheel on my car while driving. How could this happen? This morning I went to the dealership to change tires on my car. I had to wait for the mechanic to come in to work in the morning because I was a bit early, and I didn’t really have an appointment, I just showed up. After a while the mechanic arrived and after some 15 minutes they gave me my keys back and told me “you’re good to go”. Happily I drove down to the office with new tires on the car.

bilmann

 A little later today I was going to see one of our clients, and I started driving. After a few kilometers I started to hear a banging noise from the back of my car. This had happened to me once before, several years back, but I immediately understood that one of the wheels had not been properly fastened and that I should stop right away. I pulled over and walked around the car – and yes, the bolts were loose – actually they were almost falling out by themselves! Obviously this was very dangerous, and I called the dealership and told them to come and finish the job on the roadside. They actually showed up after just a few minutes, and the guy coming over appoligized and tightened all the bolts. I suggested to him that they should consider having a second person check the work of the mechanic before handing the vehicle back to the owner. He agreed – and told me it was “the new guy” who’d changed the tires in the morning.

This was down to several factors – a new guy, maybe with insufficient training, and a client putting pressure on him to finish the job (I was inside the dealership drinking coffee – didn’t talk to the mechanic, but it may have been interpreted this way by him). Simple things affect our performance and can have grave consequences. Why did it not go wrong this time? Because I recognized the sound, I stopped in time before the wheel fell off – it was just luck in addition to experience with the same kind of problem. Humans can thus both be the initiating causes of accidents, and they can be barriers against accident.

A quest for knowledge – and the usefulness of your HR department in functional safety management

Most firms claim that their people is the most important asset. If this has any effect on operations, is another thing – some actually mean it and others seem not to do so much about keeping their people well-equipped for the tasks they need  to do.

knowledge_management

When it comes to functional safety, competence management is a very important part of the game. In many projects, one of the major challenges is related to getting the right information and documentation from suppliers. Why is this so difficult? It comes down to work processes, communication and knowledge, as discussed in a previous post. One requirement common to IEC 61508 and 61511 is that every role involved in the safety lifecycle should be competent to perform his or her function. In practice, this is only fulfilled if each of these roles have clear competence requirement descriptions and expectations, a description of how competence will be assessed, and how knowledge will be created for those roles.

There are many ways of training your people, and this is a huge part of the field of HR. Most likely, people in your company’s HR functions actually know a great deal about planning, organizing and executing competence development programs. Involving them in your functional safety management planning can thus be a good idea! A few key issues to think about:

  • What are the requirements for your key roles?
  • What are your key roles (package engineer, procurement specialist(!), instrument engineer, project manager, etc., etc.)?
  • How do you check if they have the right competence? (peer assessment, tests, interviews, experience, etc.)?
  • What training resources do you have available? (Courses, e-learning, on-the-job-training, self-study, etc.)?
  • How often do you need to reassess competence?
  • Who is responsile for this system? (HR, project manager, functional safety engineer, etc.)?

A firm that has this firmly in place will most likely be able to steer their supply chain and help them also gain confidence and knowledge – vastly improving communication across interfaces and thereby also the quality of cross-organizational work.

How independent should your FSA leader be?

Functional safety assessment is a mandatory 3rd party review/audit for functional safety work, and is required by most reliability standards. In line with good auditing practice, the FSA leader should be independent of the project development. Exactly what does this mean? Practice varies from company to company, from sector to sector and even from project to project. It seems reasonable to require a greater degree of independence for projects where the risks managed through the SIS are more serious. IEC 61511 requires (Clause 5.2.6.1.2) that functional safety assessments are conducted with “at least one senior competent person not involved in the project design team”. In a note to this clause the standard remarks that the planner should consider the independence of the assessment team (among other things). This is hardly conclusive.

If we go to the mother standard IEC 61508, requirements are slightly more explicit, as given by Clause 8.2.15 of IEC 61508-1:2010, which states that the level of independence shall be linked to perceived consequence class and required SILs. For major accident hazards, two categories are used in IEC 61508:

  • Class C: death to several people
  • Class D: very many people killed

For class C the standard accepts the use of an FSA team from “independent department”, whereas for class D only an “independent organization” is acceptable. Further, also for class C, an independent organization should be used if the degree of complexity is high, the design is novel or the design organization is lacking experience with this particular type of design. There are also requirements based on systematic capability in terms of SIL but those are normally less stringent in the context of industrial processes than the consequence based requirements to FSA team independence. The standard also specifies that compliance to sector specific standards, such as IEC 61511, would make a different basis for consideration of independence acceptable.

In this context, the definitions of “independent department” and “independent department” are given in Part 4 of the standard. An independent department is separate from and distinct from departments responsible for activities which take place during the specified phase of the overall system or software lifecycle subject to the validation activity. This means also, that the line managers of those departments should not be the same person. An independent organization is separate by management and other resources from the organizations responsible for activities that take place during the lifecycle phase. This means, in practice, that the organization leading a HAZOP or LOPA should not perform the FSA for the same project if there are potential major accident hazards within the scope, and preferably also not if there are any significant fatal accident risks in the project. Considering the requirement of separate management and resource access, it is not a non-conformity if two different legal entities within the same corporate structure perform the different activities, provided they have separate budgets and leadership teams.

If we consider another sector specific standard, EN 50129 for RAMS management in the European railway sector, we see that similar independence requirements exist for third-party validation activities. Figure 6 in that standard seemingly allows the assessor to be a part of the same organization as an organization involved in SIS development, but further requires for this situation that the assessor has an authorization from the national safety authority, is completely independent form the project team and shall report directly to the safety authorities. In practice the independent assessor is in most cases from an independent organization.

It is thus highly recommended to have an FSA team from a separate organization for all major SIS developments intended to handle serious risks to personnel; this is in line with common auditing practice in other fields.

Why is this important? Because we are all humans. If we feel ownership to a certain process, product or affiliation with an organization, it will inevitably be more difficult for us to point out what is not so good. We do not want to hurt people we work with by stating that their work is not good enough – even if we know that inferior quality in a safety instrumented system may actually lead to workers getting killed at work later. If we look to another field with the same type of challenges but potentially more guidance on independence, we can refer to the Sarbanes-Oxley act of 2002 from the United States. The SEC has issued guidelines about auditor independence and what should be assessed. Specifically they include:

  1. Will a relationship with the auditor create a mutual or conflicting interest with their audit client?
  2. Will the relationship place the auditor in the position of auditing his/her own work?
  3. Will the relationship result in the auditor acting as management or an employee of the audit client?
  4. Will the relationship result in the auditor being put in a position where he/she will act as an advocate for the audit client?

It would be prudent to consider at least these questions if considering using an organization that is already involved in the lifecycle phase subject to the FSA.

Electrical isolation of ignition sources on offshore installations

triangle-of-fireOne of the typical major accident scenarios considered when building and operating an offshore drilling or production rig, is a gas leak that is ignited, leading to a jet fire, or even worse, an explosion. For this scenario to happen we need three things (from the fire triangle):

  • Flammable material (the gas)
  • Oxygen (air)
  • An ignition source

The primary protection against such accidents is containment of flammable materials; avoiding leaks is the top safety priority offshore. As a lot of this equipment exists outdoors (or in “naturally ventilated areas” as standards tend to call it), it is not an option to remove “air”. Removing the ignition source hence becomes very important, in the event that you have a gas leak. The technical system used to achieve this consists of a large number of distributed gas detectors on the installation, sending message of detected gas to a controller, which then sends a signal to shut down all potential ignition sources (ie non-EX certified equipment, see the ATEX directive for details).

This being the barrier between “not much happening” and “a major disaster”, the reliability of this ignition source control is very important. Ignition sources are normally electrical systems not designed specifically to avoid ignition (so-called EX certified equipment). In order to have sufficient reliability of this set-up the number of ignition sources should be kept at a minimum; this means that the non-EX equipment should be grouped in distribution boards such that an incomer breaker can be used to isolate the whole group, instead of doing it at the individual consumer level. This is much more reliable, as the probability of a failure on demand (PFD) will contain an additive term for each of the breakers included:

PFD = PFD(Detector) + PFD(Logic) + Sum of PFD of each breaker

Consider a situation where you have 100 consumers, and the dangerous undetected failure rate for the breakers used is 10-7 failures per hour of operation, with testing every 24 months, the contribution from a single breaker is

PFD(Breaker) = 10-7 x (8760 x 2) / 2 = 0.000876

If we then have 6 breakers that need to open for full isolation, we have a breaker PFD contribution of 0.005 from the breakers (which means that with reliable gas detectors and logic solver, a full loop can satisfy a SIL 2 requirement). If we have 100 breakers the contribution to PFD is 0.08 – and the best we can hope for is SIL 1.

What is the difference between software and hardware failures in a reliability context?

Reliability engineers have traditionally focused more on hardware than software. There are many reasons for this; one reason is that traditionally safety systems have been based on analog electronics, and although digitial controls and PLC’s have been introduced throughout the 1990’s, the actual software involved was in the beginning very simple. Today the situation has really changed, but the focus in reliability has not completely taken this onboard. One of the reasons may be that reliability experts like to calculate probabilities – which they are very good at doing for hardware failures. Hardware failures tend to be random and can be modeled quite well using probabilistic tools. So – what about software? The failure mechanisms are very different – as failures in hardware are related to more or less stochastic effects stemming from load cycling, material defects and ageing, software defects or completely deterministic (we disregard stochastic algorithms here – they are banned from use in safety critical control system anyway).

Software defects exist for two reasons: design errors (flaws) and implementation errors (bugs). These errors may occur at the requirement stage or during actual coding, but irrespective of the time they occur, they are always static. They do not suddenly occur – they are latent errors hidden within the code – that will active each and every time the software state where the error is relevant is visited.

Such errors are very difficult to include in a probabilistic model. That is why reliability standards prescribe a completely different medicine; a process oriented framework that gives requirements to management, choice of methods and tools, as well as testing and documentation. These quality directed workflows and requirements are put in place such that we should have some confidence in the software not being a significant source of unsafe failures of the critical control system.

Hence – process verification and auditing take the place of probability calculations when we look at the software. In order to achieve the desired level of trust it is very important that these practices are not neglected in the functional safety work. Deterministic errors may be just as catastrophic as random ones – and therefore they must be managed with just as much rigor and care. The current trend is that more and more functionality is moved from hardware to software – which means that software errors are becoming increasingly important to manage correctly if we are not going to degrade both performance and trust of the safety instrumented systems we rely on to protect our lives, assets and the environment.