Do we invest too much in risk assessments and too little in security?

tl;dr: Don’t assess risks before you have basic security controls in place.

I recently came across a LinkedIn post from Karl Stefan Afradi linking to a letter to the editor in the Norwegian version of Computer World, criticizing our tendency to use risk assessments for all types of security decisions. The CW article can be found here: Risikostyring har blitt Keiserens nye klær.

The article raises a few interesting and very valid points:

  • Modern regulatory frameworks are often risk based, expecting risk assessments to be used to design security concepts
  • Most organizations don’t have the maturity and competence available to do this in a good way
  • Some security needs are universal, and organizations should get the basic controls right before spending too much time on risk management

I agree that basic security controls should be implemented first. Risk management definitely has its place, but not at the expense of good basic security posture. The UK NCSC cyber essentials is a good place to start to get the bare bones basic controls in place, as I listed here Sick of Security Theater? Focus on These 5 Basics Before Anything Else. When all that is in place, it is useful to add more basic security capabilities. Modern regulatory frameworks such as NIS2, or the Norwegian variant, “the Digital Security Act” do include a focus on risk assessment, but also some other key capabilities such as having a systematic approach to security management and implementing a management system approved by top management, and building incident response capabilities: Beyond the firewall – what modern cybersecurity requirements expect (LinkedIn Article).

So, what is a pragmatic approach that will work well for most organizations? I think a 3-step process can help build a strong security posture fit to the digital dependency level and maturity of the organization.

Basic security controls

Start with getting the key controls in place. This will significantly reduce the active attack surface, it will reduce the blast radius of an actual breach, and allow for easier detection and response. This should be applied before anything else.

  • Network security: divide the network into zones, and enforce control of data flows between them. This makes lateral movement harder, and can help shield important systems from exposure to attacks.
  • Patching and hardening: by keeping software up to date, and removing features we do not need we reduce the attack surface.
  • Endpoint security includes the use of anti-virus or EDR software, execution control and script blocking on endpoints. This makes it much harder for attackers to gain a foothold without being noticed, and to execute actions on compromised endpoints such as privilege escalation, data exfiltration or lateral movement techniques.
  • Access control is critical. Only people with a business need for access to data and IT systems should have access. Administrative privileges should be strictly controlled. Least privilege is a critical defense.
  • Asset management is the basis for protecting your digital estate: know what you have and what you have running on each endpoint. This way you know what to check if a critical vulnerability is found, and can also respond faster if a security incident is detected.

Managed capabilities

With the basics in place it is time to get serious about processes, competence and continuous improvement. Clarify who is responsible for what, describe processes for the most important workflows for security, and provide sufficient training. This should include incident response.

By describing and following up security work in a systematic way you start to build maturity and can actually achieve continuous improvement. Think of it in terms of the plan-do-check-act cycle. Make these processes part of corporate governance, and build it out as maturity grows.

Some key procedures you may want to consider include:

  • Information security policy (overall goals, ownership)
  • Risk assessment procedure (methodology, when it should be done, how it should be documented)
  • Asset management
  • Access control
  • Backup management
  • End user security policy
  • Incident response plan
  • Handling of security deviations
  • Security standard and requirements for suppliers

Risk-based enhancements

After step 2 you have a solid security practice in place in the organization, including a way to perform security risk assessments. Performing good security risk assessments requires a good understanding of the threat landscape, the internal systems and security posture, and how technology and information systems support business processes.

The first step to reduce the risk to the organization’s core processes from security incidents is to know what those core processes are. Mapping out key processes and how technology is supporting them is therefore an important step. A practical approach to describe this on a high level is to use SIPOC – a table format for describing a business process in terms of Suppliers – Inputs – Process – Outputs – Customers. Here’s a good explanation form software vendor Asana.

When this is done, key technical and data dependencies are included in the “INPUTS” column. Key suppliers should also include here cloud and software vendors. This way we map out key technical components required to operate a core process. From here we can start to assess the risk from security incidents to this process.

  • (Threats): Who are the expected threat actors and what are their expected modes of operation in terms of operational goals, tradecraft, etc. Frameworks such as MITRE ATT&CK can help create a threat actor map.
  • (Assets and Vulnerabilities): Describe the data flows and assets supporting the process. Use this to assess potential vulnerabilities related to the use and management of the system, as well as the purely technical risks. This can include CVE’s, but typically social engineering risks, logic flaws, supply-chain compromise and other less technical vulnerabilities are more important.

We need to evaluate the risk to the business process from the threats, vulnerabilities and assets-at-risk. One way to do this is to define “expected scenarios” and asses both the likelihood (low, medium high) and consequences to the business process of that scenario. Based on this we can define new security controls to further reduce the risk beyond the contribution from basic security controls.

Note that the risk treatment we design based on the risk assessment can include more than just technical controls. It can be alternative processes to reduce the impact of a breach, it can be reduced financial burden through insurance policies, it can be well-prepared incident response procedures, good communication with suppliers and customers, and so on. They key benefit of the risk assessment is in improving business resilience, not selecting which technical controls to use.

Do we invest too much in risk assessments then?

Many organizations don’t do risk assessments. That is a problem, but what makes it worse, is that immature organizations also fail the previous steps here. They don’t implement basic security controls. They also don’t have clear roles and responsibilities, or procedures for managing security. For those organizations, investing in risk management should not be the top priority, it should be getting the basics right.

For more mature organizations, the basics may be in place, but the understanding of how security posture weaknesses translate to business risk may be weak or non-existent. Those businesses would benefit from investing more in good quality risk assessment. It is also a good vaccination against the Shiny Object Syndrome – Security Edition (we need a new firewall and XDR and DLP and this and that and next-gen dark AI blockchain driven anomaly based network immune system)

Teaching smart things cyber self defense: ships and cars that fight back

We have connected everything to the Internet – from power plants to washing machines, from watches with fitness trackers to self-driving cars and even self-driving gigantic ships. At the same time, we struggle to defend our IT systems from criminals and spies. Every day we read about data breaches and cyber attacks. Why are we then not more afraid of cyber attacks on the physical things we have put into our networks?

  • Autonomous cars getting hacked – what if they crash into someone?
  • Autonomous ships getting hacked – what if they lose stability and sink due to a cyber attack or run over a small fishing boat?
  • Autonomous light rail systems – what if they are derailed at high speed due to a cyber attack?

Luckily, we are not reading about things like this in the news, at least not very often. There have been some car hacking mentioned, usually demos of possibilities. But when we build more and more of these smart systems that can cause disasters of control is lost, shouldn’t we think more about security when we build and operate them? Perhaps you think that someone must surely be taking care of that. But fact is, in many cases, it isn’t really handled very well.

How can an autonomuos vessel defend against cyber attacks?

What is the attack surface of an autonomous system?

The attack surface of an autonomous system may of course vary, but they tend to have some things in common:

  • They have sensors and actuators communicating with a “brain” to make decisions about the environment they operate in
  • They have some form of remote access and support from a (mostly) human operated control center
  • They require systems, software and parts from a high number of vendors with varying degree of security maturity

If we for the sake of the example consider an autonomous battery powered vessel at sea, such as ferry. Such a vessel will have multiple operating modes:

  • Docking to the quay
  • Undocking from the quay
  • Loading and offloading at quay
  • Journey at sea
  • Autonomous safety maneuvers (collision avoidance)
  • Autonomous support systems (bilge, ballast, etc)

In addition there will typically be a number of operations that are to some degree human led, such as search and rescue if there is a man over board situation, firefighting, and other operations, depending on the operating concept.

To support the operations required in the different modes, the vessel will need an autonomous bridge system, an engine room able to operate without a human engineer in place to maintain propulsion, and various support systems for charging, mooring, cargo handling, etc. This will require a number of IT components in place:

  • Redundant connectivity with sufficient bandwidth (5G, satellite)
  • Local networking
  • Servers to run the required software for the ship to operate
  • Sensors to ensure the ship’s autonomous system has good situational awareness (and the human onshore operators in the support center)

The attack surface is likely to be quite large, including a number of suppliers, remote access systems, people and systems in the remote control center, and remote software services that may run in private data centers, or in the cloud. The main keyword here is: complexity.

Defending against cyber piracy at high seas

With normal operation of the vessel, its propulsion and bridge systems would not depend on external connectivity. Although cyber attacks can also hit conventional vessels, much of the damage can be counteracted by seafarers onboard taking more manual control of the systems and turning off much of the “smartness”. With autonomous systems this is not always an option, although there are degrees of autonomy and it is possible to use similar mechanisms if the systems are semi-autonomous with people present to take over in case of unusual situations. Let’s assume the systems are fully autonomous and there is nobody onboard to take control of them.

Since there are no people to compensate for digital systems working against us, we need to teach the digital systems to defend themselves. We can apply the same structural approach to securing autonomous systems, as we do to other IT and OT systems; but we cannot rely on risk reduction from human local intervention. If we follow “NSM’s Grunnprinsipper for IKT-sikkerhet” (the Norwegian government’s recommendations for IT security, very similar to NIST’s cybersecurity framework), we have the following key phases:

  1. Identify: know what you have and the security posture of your system
  2. Protect: harden your systems and use security technologies to stop attackers
  3. Detect: set up systems so that cyber attacks can be detected
  4. Respond: respond to contain compromised systems, evict intruders, recover capabilities, improve hardening and return to normal operations

These systems are also operational technologies (OT). It may therefore be useful also to refer to IEC 62443 in the analysis of the systems, especially to assess the risk to the system, assign requires security levels and define requirements. Also the IEC 62443 reference architecture is useful.

It is not so that all security systems have to be working completely autonomously for an autonomous system, but it has to be more automated in a normal OT system, and also in most IT systems. Let’s consider what that could mean for a collision avoidance system on an autonomous vessel. The job of the collision avoidance system can be defined as follows:

  1. Detect other vessels and other objects that we are on collision course with
  2. Detect other objects close-by
  3. Choose action to take (turn, stop, reverse, alert on-shore control center, communicate to other vessels over radio, etc)
  4. Execute action
  5. Evaluate effect and make corrections if needed

In order to do this, the ship has a number of sensors to provide the necessary situational awareness. There has been a lot of research into such systems, especially collaborative systems with information exchange between vessels. There have also been pilot developments, such as this one https://www.maritimerobotics.com/news/seasight-situational-awareness-and-collision-avoidance by the Norwegian firm Maritime Robotics.

We consider a simplified view of how the collision avoidance system works. Sensors tell the anti collision system server about what it sees. The traffic is transmitted over proprietary protocols, some over tcp, some over udp (camera feeds). Some of the traffic is not encrypted, but all is transferred over the local network. The main system server is processing the data onboard the ship and making decisions. Those decisions go to functions in the autonomous bridge to take action, including sending radio messages to nearby ships or onshore. Data is also transmitted to onshore control via the bridge system. Onshore can use remote connection to connect to the collision avoidance server directly richer data, as well as overriding or configuring the system.

Identify

The system should automatically create a complete inventory of its hardware, software, networks, and users. This inventory must be available for automated decision making about security but also for human and AI agents working as security operators from onshore.

The system should also automatically keep track of all temporary exceptions and changes, as well as any known vulnerabilities in the system.

In other words: a real-time security posture management system must be put in place.

Protect

An attacker may wish to perform different types of actions on this vessel. Since we are only looking at the collision avoidance system here we only consider an adversary that wants to cause an accident. Using a kill-chain approach to our analysis, the threat actor thus has the following tasks to complete:

  • Recon: get an overview of the attack surface
  • Weaponization: create or obtain payloads suiteable for the target system
  • Delivery: deliver the payloads to the systems. Here the adversary may find weaknesses in remote access, perform a supply-chain attack to deliver a flawed update, use an insider to gain access, or compromise an on-shore operator with remote access privileges.
  • Execution: if a technical attack, automated execution will be necessary. For human based attacks, operator executed commands will likely be the way to perform malware execution.
  • Installation: valid accounts on systems, malware running on Windows server
  • Command and control: use internet connection to remotely control the system using installed malware
  • Actions on objectives: reconfigure sensors or collision avoidance system by changing parameters, uploading changed software versions, or turning the system off

If we want to protect against this, we should harden our systems as much as possible.

  • All access should require MFA
  • Segregate networks as much as possible
  • Use least privilege as far as possible (run software as non-privileged users)
  • Write-protect all sensors
  • Run up-to-date security technologies that block known malware (firewalls, antivirus, etc)
  • Run only pre-approved and signed code, block everything else
  • Remote all unused software from all systems, and disable built-in functionality that is not needed
  • Block all non-approved protocols and links on the firewall
  • Block internet egress from endpoints, and only make exceptions for what is needed

Doing this will make it very hard to compromise the system using regular malware, unless operations are run as an administrator that can change the hardening rules. It will most likely protect against most malware being run as an administrator too, if the threat actor is not anticipating the hardening steps. Blocking traffic both on the main firewall and on host based firewalls, makes it unlikely that the threat actor will be able to remove both security controls.

Detect

If an attacker manages to break into the anti-collision system on our vessel, we need to be able of detecting this fast, and responding to it. The autonomous system should ideally perform the detection on its own, without the need for a human analyst due to the need for fast response. Using human (or AI agents) onshore in addition is also a good idea. As a minimum the system should:

  • Log all access requests and authorization requests
  • Apply UEBA (user entity behavior analysis) to detect an unusual activity
  • Use advanced detection technologies such as security features of a NGFW, a SIEM with robust detection rules, thorough audit logging on all network equipment and endpoints
  • Use EDR technology to provide improved endpoint visibility
  • Receive and use threat intelligence in relevant technologies
  • Use deep packet inspection systems with protocol interpreters for any OT systems part of the anti-collision system
  • Map threat models to detection coverage to ensure anticipated attacks are detectable

By using a comprehensive detection approach to cyber events, combined with a well-hardened system, it will be very difficult for a threat actor to take control of the system unnoticed.

Respond and recover

If an attack is detected, it should be dealt with before it can cause any damage. It may be a good idea to conservatively plan for physical response also for an autonomous ship with a cybersecurity intrusion detection, even if the detection is not 100% reliable, especially for a safety critical system. A possible response could be:

  • Isolate the collision avoidance system from the local network automatically
  • Stop the vessel and maintain position (using DP if available and without security detections, and as a backup to drop anchor)
  • Alert nearby ships over radio that “Autonomous ship has lost anti-collision system availability and is maintaining position. Please keep distance. “
  • Alert onshore control of the situation.
  • Run system recovery

System recovery could entail securing forensic data, automatically analysing data for indicators of compromise and identify patient zero and exploitation path, expanding blast radius to continue analysis through pivots, reinstall all affected systems from trusted backups, update configurations and harden against exploitation path if possible, perform system validation, transfer back to operations with approval from onshore operations. Establishing a response system like this would require considerable engineering effort.

An alternative approach is to maintain position, and wait for humans to manually recover the system and approve returning to normal operation.

The development of autonomous ships, cars and other high-risk applications are subject to regulatory approval. Yet, the focus of authorities may not be on cybersecurity, and the competence of those designing the systems as well as the ones approving them may be stronger in other areas than cyber. This is especially true for sectors where cybersecurity has not traditionally been a big priority due to more manual operations.

A cyber risk recipe for people developing autonomous cyber-physical systems

If we are going to make a recipe for development of responsible autonomous systems, we can summarize this in 5 main steps:

  • Maintain good cyber situational awareness. Know what you have in your systems, how it works, and where you are vulnerable – and also keep track of the adversary’s intentions and capabilities. Use this to plan your system designs and operations. Adapt as the situation changes.
  • Rely on good practice. Use IEC 62443 and other know IT/OT security practices to guide both design and operation.
  • Involve the suppliers and collaborate on defending the systems, from design to operations. We only win through joint efforts.
  • Test continuously. Test your assumptions, your systems, your attack surface. Update defenses and capabilities accordingly.
  • Consider changing operating mode based on threat level. With good situational awareness you can take precautions when the threat level is high by reducing connectivity to a minimum, moving to lower degree of autonomy, etc. Plan for high-threat situations, and you will be better equipped to meet challenging times.

Extending the risk assessment mind map for information security

This post is based on the excellent mindmap posted on taosecurity.blogspot.com – detailing the different fields of cybersecurity. The author (Richard) said he was not really comfortable with the risk assessment portion. I have tried to change the presentation of that portion – into the more standard thinking about risk stemming from ISO 31000 rather than security tradition.

Read team and blue team activities are presented under penetration testing in the original mind map. I agree that the presentation there is a bit off – red team is about pentesting, whereas blue team is the defensive side. In normal risk management lingo, these terms aren’t that common – which is why I left them out of the mind map for risk assessment. For an excellent discussion on these terms, see this post by Daniel Miessler: https://danielmiessler.com/study/red-blue-purple-teams/#gs.aVhyZis.

risk_assessment_security
Suggested presentation of risk assessment mind map – to wrap in in typical risk assessment activity descriptions

The map shown here breaks down the risk assessment process into the following containers:

  • Context description
  • Risk identification
  • Risk analysis
  • Treatment planning

There are of course many links between other security related activities and risk assessments. Risk monitoring and communication processes are connecting these dots.

Also threat intelligence is essential for understanding the context – which again dictates attack scenarios and credibility necessary to prioritize risks. Threat intelligence entails many activities, as indicated by the original mind map. One source of intel from ops that is missing on that map by the way, is threat hunting. That also ties into risk identification.

I have also singled out security ops as it is essential for risk monitoring. This is required on the tactical level to evaluate whether risk treatments are effective.

Further, “scorecards” have been used as a name for strategic management here – and integration in strategic management and governance is necessary to ensure effective risk management – and involving the right parts of the organization.

IEC 61511 Security – getting the right detail level

When performing the risk and vulnerability assessment required by the new IEC 61511 standard, make sure the level of detail is just right for your application. Normally the system integrator is operating at the architectural level, meaning signal validation in software components should probably already have been dealt with. On the other hand, upgrading and maintaining the system during the entire lifecycle has to be looked into. Just enough detail can be hard to aim for but digging too deep is costly, and being too shallow doesn’t help your decision making. Therefore, planning the security assessment depth level already from the beginning should be a priority!

Starting with the context – having the end in mind

The purpose of including cybersecurity requirements in a safety instrumented system design is to make sure the reliability of the system is not threatened by security incidents. That reliability requires each safety instrumented function (SIF) to perform its intended task at the right moment; we are concerned with the availability and the integrity of the system.

 

072115_1313_Uncertainty1.png
The probability of failure on demand for a safety critical function usually depends on random error distributions and testing regimes. How can hacker threats be included in the thinking around reliability engineering? The goal is to remain confident in the reliability calculations, so that quantitative risk calculations are still meaningful.

 

In order to understand the threats to your system you need to start with the company and its place in the world, and in the supply chain. What does the company do? Consider an oil producer active in a global upstream market – producing offshore, onshore, as well as from unconventional sources such as tar-sands, arctic fields and shale oil. The company is also investing heavily in Iraq, including areas recently captured from ISIS. Furthermore, on the owner side of this company you find a Russian oligarch, who is known to be close to the Kremlin, as a majority stock holder. The firm is listed on the Hong Kong stock Market. Its key suppliers are Chinese engineering firms and steel producers, and its top customers are also Chinese government-backed companies. How does all of this affect the threat landscape as it applies to this firm?

The firm is interfering with causes that may trigger the interest of hacktivists:

  • Unconventional oil production
  • Arctic oil production

It also operates in an area that can make them a target for terrorist groups, in one of the most politically unstable regions in the world, where the world’s largest military powers also have to some degree opposing interests. This could potentially draw the interest of both terrorist groups and of nation state hackers. It is also worth noting that the company is on good terms with both the Russian and Chinese governments, two countries often accused of using state sponsored hackers to target companies in the west. The largest nation state threat to this oil company may thus be from western countries, including the one headed by Donald Trump. He has been quite silent on cybersecurity after taking office but issued statements during his campaign in 2016 hinting at more aggressive build-ups of offensive capacities. So, the company itself should at least expect the interest of script kiddies, hacktivists, cybercriminals, terrorists, nation states and insiders. These groups have quite varying capacities and the SIS is typically hard to get at due to multiple firewalls and network segregations. Our main focus should thus be of hacktivists, terrorists and nation states – with cybercriminals and insiders acting as proxies (knowingly or not).

The end in mind: keeping safety-critical systems reliable also under attack, or at least make it an insignificant contribution to unreliability.

Granularity of security assessment

Our goal of this discussion was to find the right depth level for risk and vulnerability assessments under IEC 61511. If we start with the threat actors and their capabilities, we observe some interesting issues:

  • Nation states: capable of injecting unknown features into firmware and application software at the production stage, including human infiltration of engineering teams. This can also be “features” sanctioned by the producer in some countries. Actual operations can include cyberphysical incursions with real asset destruction.
  • Terrorists: infiltration of vendors less likely. Typical capabilities are ATP’s using phishing to break the attack surface, and availability attacks through DDoS provided the SIS can be reached. Physical attack is also highly likely.
  • Cybercriminals: similar to terrorists, but may also have more advanced capabilities. Can also act out of own interest, e.g. through extortion schemes.
  • Hacktivists: unlikely to threaten firmware and software integrity. Not likely to desire asset damage as that can easily lead to pollution, which is in conflict with their likely motivations. DDoS attacks can be expected, SIS usually not exposed.

Some of these actors have serious capabilities, and it is possible that they will be used if the political climate warrants this. As we are most likely relying on procured systems form established vendors, using limited variability languages for the SIS, we have little influence over the low-level software engineering. Configurations, choice of blocks and any inclusion of custom-designed software blocks is another story. Regarding our assessment we should thus, at least, include the following aspects:

  • Procurement – setting security requirements and general information security requirements, and managing the follow-up process and cross-organizational competence management.
  • Software components – criticality assessment. Extra testing requirements to vendors. Risk assessment including configuration items.
  • Architectural security – network segregation, attack surface exposure, monitoring, security technologies, responsible organizations and network operations
  • Hardware – tampering risk, exposure to physical attacks, ports and access points, network access points including wireless (VSAT, microwave, GSM, WiFi)
  • Organizational security risks: project organization, operations organization. Review of roles and responsibilities, criticality of key personnel, workload aspects, contractual interfaces, third-party personnel.

Summary

This post does not give a general procedure for depth of analysis decisions but it does outline important factors. Always start with the context to judge both impact and expected actions from threat actors. Use this to determine capabilities of the main threat actors. This will help you decide the granularity level of your assessment. The things that are outside of your control should also not be neglected by considered an uncertainty point that may influence the necessary security controls you need to put in place.

 

granularity
A sketch of key factors to include when deciding on the granularity for a cybersecurity risk assessment under IEC 61511

 

 

 

Users: from threats to security enhancers?

Security should be an organization-wide effort. That means getting everyone to play the same game, requiring IT to stop thinking about “users as internal threats”, and start instead to think about “internal customers as security enhancers”. This can only be achived by using balanced security measures, involving the internal customers (users) through sharing the risk picture, and putting risk based thinking behind security planning to drive rational and balanced decisions. For many organiations with a pure compliance focus this can be a challenging journey – but the rewards at the end is an organization better equipped to tackle a dynamic threat landscape. 

Users have traditionally been seen as a major threat in information security settings. The insider threat is a very real thing but this does not mean that the user is the threat as such. There has recently been much discussion about how we can achieve a higher degree of cybersecurity maturity in organizations, and whether cybersecurity awareness training really works. This post does not give you the answers but describes some downsides to the compliance oriented tradition. The challenge is to find a good balance between controls and compliance on one side, and driving a positive security culture on the other.

wp-image-10420430jpg.jpg
Locking down users’ tools too much may enhance security on paper – until you you consider the effect of trust erosion on the human attack surface. When knowledge workers feel they are not trusted by their organization, they also feel undervalued, something that can create hostility towards management in general, and cybersecurity policies specifically. There is no fun working in an environment where all the toys are locked down. 

Most accidents involve “human error” as part of the accident chain, pretty much like most security breaches also involve some form of human error, typically a user failing to spot a social engineering attempt where the security technology is also inept at making good protection decisions. Email is still the most common malware delivery method, and phishing would not work without humans on the other end. This picture is what your security department is used to seeing; the user performs some action that allows the attacker to penetrate the organization. Hence, the user is a threat. The cure for this is supposed to be cyberscurity awareness training teaching users not to open attachments from sketchy sources, not to click those links, not to use weak passwords and so on. The problem is just that this only partially works. Some people have even gone so far as to say that this is completely useless.

The other part of the story is the user that reports his or her computer is misbehaving, or that some resoures have become unavailable, or forwards spear-phishing attempts. Those users are complying with policy and allowing the organization to spot potential attempts of recon or attack before the fact, or at least realtively soon after a breach. These users are security enhancers, in the way security awareness training is trying to at least make users a little bit less dangerous.

Because people do risky things when possible, the typical IT department answer to the insider threat is to lock down every workstation as much as possible, to “harden it”, ie making the attack surface smaller. This attack surface view, however, only considers the technology, not the social component. If you lock down the systems more than what is felt necessary by the users, they will probably start opposing company policies. They will not be reporting suspicious activities as often anymore. They will go through the motions of your awareness training but little behavioral change is seen afterwards. You risk that shadow IT starts to take a hold of your business – that employees use their private cloud accounts, portable apps or private computers to do their jobs – because the tools they feel they need to do their jobs are locked down, made inflexible or simply unavailable by the IT department in order to “reduce the attack surface”. So, not only are you risking to prime your employees for social engineering attacks (angry employees are easier to manipulate), making your staff less able to benefit from your training courses, but you may also be significantly increasing the technical attack surface through shadow IT.

So what is the solution – to allow users to do whatever they want on the network, give the admin rights and no controls? Obviously a bad idea. Keywords are balanced measures, involvement and risk based thinking.

  • Balanced: there must be a balance between security and productivity. A full lockdown may be required for information objects of high value to the firm and with credible attack scenarios, but not every piece of data and every operation is in that category.
  • Involvement: people need to understand why security measures are in place to make sense of the measures. Most security measures are impractical to people just wanting to get the job done. Understanding the implications of a breach and the cost-benefit ratio of the measures in place greatly helps people motivate themselves to do what feels slightly impractical.
  • Risk based thinking: measures must be adequate to the risk posed to the organization and not exaggerated. The risk picture must be shared with the employees as part of the security communication – this is a core leadership responsibility and the foundation of security aware cultures.

In the end it comes down to respect. Respect other people for what they do, and what value they bring to the organization. Think of them as customers instead of users. Only drug dealers and IT departments refer to their customers as users (quoted from somewhere forgotten on the internet).

Do SCADA vulnerabilities matter?

Sometimes we talk to people who are responsible for operating distributed control systems. These are sometimes linked up to remote access solutions for a variety of reasons. Still, the same people do often not understand that vulnerabilities are still found for mature systems, and they often fail to take the typically simple actions needed to safeguard their systems.

For example, a new vulnerability was recently discovered for the Siemens Simatic CP 343-1 family. Siemens has published a description of the vulnerability, together with a firmware update to fix the problem: see Siemens.com for details.

So, are there any CP 343’s facing the internet? A quick trip to Shodan shows that, yes, indeed, there are lots of them. Everywhere, more or less.

shodan_cp343

Now, if you did have a look at the Siemens site, you see that the patch was available from release date of the vulnerability, 27 November 2015. What then, is the average update time for patches in a control system environment? There are no patch Tuesdays. In practice, such systems are patched somewhere from monthly to never, with a bias towards never. That means that the bad guys have lots of opportunities for exploiting your systems before a patch is deployed.

This simple example reinforces that we should stick to the basics:

  • Know the threat landscape and your barriers
  • Use architectures that protect your vulnerable systems
  • Do not use remote access where is not needed
  • Reward good security behaviors and sanction bad attitudes with employees
  • Create a risk mitigation plan based on the threat landscape and stick to it practice too

 

SIL and ballast systems

Working on floating oil and gas facilities, one question keeps popping up about ballast systems. Should they have SIL requirements, and what should in this case the requirements be? When seeking to establish requirements for such systems, several issues are uncovered. First of all, current designs of ballast systems are very robust due to evolution of designs and requirements in shipping over a long time. Further, the problem is much more complex than collecting a few well-defined failure modes with random error data leading to a given situation, as typically seen in may process industry type problem descriptions. This complexity depends on a number of factors, and some of them are specific to each ship or installation, such as location, ship traffic density or operating practices of personnel onboard. Therefore, any quantitative estimates of “error probabilities” contributing to an expected return frequency of critical events concerning the system will have significant uncertainties associated with them.

ship-storm1

A ballast system is used to maintain the stability of a ship or a floating hull structure under varying cargo loading conditions and in various sea conditions and ship drafts. Water is kept in tanks dispersed around the hull structure, and can be pumped in or out, or transfered between tanks, to maintain stability. Errors in ballasting operations can lead to loss of stability, which in the worst consequence means a sunken ship. The ballasting operation is normally a semi manual operation where a marine operator would use a loading computer to guide decisions about ballasting, and manually give commands to a computer based control system on where to transfer water into or out of a particular ballast tank. Because this is such a critical safety system it is a natural question to ask: “what are the performance requirements?”.

Ballast systems have been part of shipping for hundreds of years. Requirements for ballast systems are thus set in the classification rules of ship classification societies, such as Lloyd’s Register, DNV GL or ABS. These requirements are typically presecriptive in nature and focus on robustness and avoidance of common cause failures in the technology. Maritime classification societies do not refer to safety integrity levels but rely on other means of ensuring safey operation and reliability. Society has accepted this practice for years, for very diverse vessels ranging from oil tankers to passenger cruise ships.

In oil and gas operations, the use of safety integrity levels to establish performance requirements for instrumented safety functions is the norm, and standards such as IEC 61508 are used as the point of reference. The Norwegian Oil and Gas Association has made a guideline that is normally applied for installations in Norwegian waters, which offers a simplification of requirements setting based on “typical performance”. This guideline can be freely downloaded from this web page. This guideline states that for “start of ballasting for rig re-establishment”, ths system should conform to a SIL 1 requirement. The “system” is described as consisting of a ballast control node, 2 x 100% pumps and three ballast valves. In appendix A.12 of the guideline a description of this “sub-function” is given with a calculation of achievable performance.

It may be argued that this functional description is somewhat artificial because the ballast system on a production installation is normally operated more or less continously. The function is defined for a single ballast tank/compartment, irrespective of the number of tanks and the necessary load balancing for re-establishing stability. The Guideline 070 approach is based on “typical performance” of the safety system as it is defined, and is not linked directly to the required risk reduction provided from the system. Multiple approaches may be taken to assign safety integrity levels based on risk analysis, see for example IEC 61508. One such method that is particularly common in the process industries and the oil and gas industry is “layers or protection analysis”, or LOPA for short. In this type of study, multiple initating events can contribute to one hazard situation, for example “sunken ship due to loss of stability”. Multiple barriers or “independent protection layers” can be credited for reducing the risk of this hazard being realized. In order to use a risk based method for setting the integrity requirement, it is necessary to define what is an acceptable frequency of this event occurring. Let us say for the sake of the discussion that it is acceptable that the mean time between each “sunken ship due to loss of stability” is 1 million years. How can we reason about this to establish requirements for the ballast system? The functional requirement is that we should “be able to shift ballast loading to re-establish stability before condition is made unrecoverable”. In order to start analyzing this situation, we need to estimate how often we will have a condition that can lead to such an unrecoverable situation if not correctly managed. Let us consider three such “initiating events”:

  • Loading operator error during routine ballasting (human error)
  • Damage to hull due to external impact
  • Error in load computer calculations

Both of these situations depend on a number of factors. The probability that the loading operator will perform an erronous situation depends on stress levels, competence/training and management factors. A throrough analysis using “human reliability analysis” can be performed, or a more simplified approach may be taken. We may, for example, make the assumption that the average operator makes 1 error without noticing immediately every 100 years (this is an assumption – must be validated if used).

Damage to hull due to external impact would depend on the ship traffic density in the area, if there is a difficult political condition (war, etc.), or if you are operating in arctic environments where ice impact is likely (think Titanic). Again, you may do extensive analysis to establish such data, or make some assumptions based on expert judgment. For example, we may assume a penetrating ship collition every 100 years on average.

What about erros in load computer calculations? Do the operators trust the load computer blindly, or do they perform sanity checks? How was the load computer programmed? Is the software mature? Is the loading condition unusual? Many questions may be asked here as well. For the sake of this example, let us assume there is no contribution from the loading computer.

We are then looking at an average initiating event frequency of 0.1 for human errors and 0.01 for hull damage.

Then we should think about what our options for avoiding the accidental scenario are, given that one of the initiating events have already occurred. As “rig re-establishment” depends on the operator performing some action on the ballast system, key to such barriers is making the operator aware of the situation. One natural way to do this would be to install an alarm for indicating a dangerous ballast condition, and train the operator to respond. What is the reliability of this as a protection layer? The ballast function itself is what we are trying to set the integrity requirement for, and any response of the operator requires this system to work. Simply notifying the operator is thus necessary but not enough for us. In case the ballast system fails when the operator tries to rectify the situation, the big question is, does the operator have a second option? Such options may be a redundant ballast system, not using the same components to avoid common cause failure. In most situations the dynamics will be slow enough to permit manual operation of pumps and valves from local control panels. This is a redundant option if the operator is trained for it. If the alarm does not use the same components as the function itself, we have an independent protection layer. The reliability of this, put together with the required response of a well-trained operator cannot be credited as better than a 90% success rate in a critical situation (ref. IEC 61511, for example).

So, based on this super-simplified analysis, are we achiving our required MTTF of 1 million years?

Events per year: 0.02.

Failure in IPL: Alarm + operator response using local control panels: 0.1.

OK, se we are achieving an MTTF of:

1/(0.02 x 0.1) = 500 years.

This is pretty far from where we said we should be. First of all, this would require our ballast system to operate with better than SIL 4 performance (which is completely unrealistic), and furthermore, it includes the same operator again performing manual actions. Of course, considering how many ships are floating at sea and how few of them are sinking, this is probably a quite unrealistic picture of the real risk. Using super-simple tools for adressing complex accidental scenarios is probably not the best solution. For example, the hull penetration scenario itself has lots of complexity – penetrating a single compartment will not threaten global stability. Furthermore, the personnel will have time to analyze and act on the situation before it develops into an unrecoverable loss of stability – but the reliability of them doing so depends on a lot on their training, competence and the installation’s leadership.

The take-away points from this short discussion are three:

  • Performance of ballast systems on ships is very good due to long history and robust designs
  • Setting performance requirements based on risk analysis requires a more in-depth view of the contributing factors (initators and barriers)
  • Uncertainty in quantiative measures is very high in part due to complexity and installation specific factors, aiming for “generally accepted” technical standards is a good starting point.

Is the necessary SIL related to layers of protection or operating practices?

A safety integrity level is a quantification of the necessary risk reduction we need from an automated safety system to achieve acceptable risk levels for some industrial system. The necessary risk reduction, obviously depends also on other activities and systems we put in place to reduce risk from its “intrinsic” level. The following drawing illustrates the role of different things we can do to achieve acceptable risk for a technical asset.

Figure showing how risk reducing measures work together to bring the risk down to an acceptable level.
Figure showing how risk reducing measures work together to bring the risk down to an acceptable level.

Consider for example a steel tank that is filled with pressurized gas. One potential hazard here is overpressure in the tank, which may cause a leak and the gas can be both toxic and flammable – obviously a dangerous situation. When working with risk, we need to define what we mean by risk in terms of “acceptance criteria”. In this case, we may say that we accept an explosion due to leak of gas and ignition of the gas afterwords once every one million years – that is a frequency of 10-6 per year. The initial frequency is maybe 0.1 per year, if the source of the high pressure is a controller intended to keep the pressure steady over time by adjusting a valve. Normally, such process control loops have one malfunction every 10 years (a coarse rule of thumb). Passive technologis can here be a spring-loaded safety valve that would open on high pressure and let the gas out to a safe location, for example a flare system where the gas can be burnt off in a controlled manner. This reduces the probability by 99% (such a passive valve tends to fail no more often than 1 out of 100 times). In addition to this, there is an independent alarm on the tank, giving a message to an operator in a control room that the pressure is increasing, and the oprator has time to go and check what is going on, and shut off supply of gas to the tank by closing a manual valve. How reliabile is this operator? With sufficient time, and allowing for some confusion due to stress, we may claim that the operator manages to intervene 9 out of 10 times (such numbers can be found by looking at human reliability analysis – a technique for assessing performance of trained people under various situations – developed primarily within the nuclear industry). In addition, a terrible explosion does not automatically happen if there is a leak – something needs to ignite the gas. Depending on the ignition sources we can assign a probability to this (models exist). For this case, let us assume the probability of ignition of a gas cloud in this location is 10%. We have now reduced the probability of this occuring by a factor of 1000 from an initial “intrinsic” frequency of 0.01. The frequency of such explosions due to leak in the tank before using any automatic shutdown system is thus 0.01 x 0.001 = 0.00001 = 10-5. The remaining reduction needed to bring the frequency down to 1 in a million years for the explosion is then an automated shutdown function that does not fail more than 1 out 10 demands – a PFD of 0.1. This means, we need a safety instrumented function with a probability of failure on demand of 0.1 – which corresponds to a SIL 1 requirement. The process we used to deduce this number is by the way known as a LOPA – a layers of protection analysis. The LOPA is one of many tools in the engineer’s toolbox for performing risk assessments.

What this illustrates is that the requirement to an automated shutdown function depends on other risk mitigation efforts – and the reliability of those barrier elements. What if the operator does not have time to intervene or cannot be trusted? If we take away the effect of the operator’s actions we see immediately that we need a SIL 2 function to achieve acceptable level of safety.