Extending the risk assessment mind map for information security

This post is based on the excellent mindmap posted on taosecurity.blogspot.com – detailing the different fields of cybersecurity. The author (Richard) said he was not really comfortable with the risk assessment portion. I have tried to change the presentation of that portion – into the more standard thinking about risk stemming from ISO 31000 rather than security tradition.

Read team and blue team activities are presented under penetration testing in the original mind map. I agree that the presentation there is a bit off – red team is about pentesting, whereas blue team is the defensive side. In normal risk management lingo, these terms aren’t that common – which is why I left them out of the mind map for risk assessment. For an excellent discussion on these terms, see this post by Daniel Miessler: https://danielmiessler.com/study/red-blue-purple-teams/#gs.aVhyZis.

risk_assessment_security
Suggested presentation of risk assessment mind map – to wrap in in typical risk assessment activity descriptions

The map shown here breaks down the risk assessment process into the following containers:

  • Context description
  • Risk identification
  • Risk analysis
  • Treatment planning

There are of course many links between other security related activities and risk assessments. Risk monitoring and communication processes are connecting these dots.

Also threat intelligence is essential for understanding the context – which again dictates attack scenarios and credibility necessary to prioritize risks. Threat intelligence entails many activities, as indicated by the original mind map. One source of intel from ops that is missing on that map by the way, is threat hunting. That also ties into risk identification.

I have also singled out security ops as it is essential for risk monitoring. This is required on the tactical level to evaluate whether risk treatments are effective.

Further, “scorecards” have been used as a name for strategic management here – and integration in strategic management and governance is necessary to ensure effective risk management – and involving the right parts of the organization.

IEC 61511 Security – getting the right detail level

When performing the risk and vulnerability assessment required by the new IEC 61511 standard, make sure the level of detail is just right for your application. Normally the system integrator is operating at the architectural level, meaning signal validation in software components should probably already have been dealt with. On the other hand, upgrading and maintaining the system during the entire lifecycle has to be looked into. Just enough detail can be hard to aim for but digging too deep is costly, and being too shallow doesn’t help your decision making. Therefore, planning the security assessment depth level already from the beginning should be a priority!

Starting with the context – having the end in mind

The purpose of including cybersecurity requirements in a safety instrumented system design is to make sure the reliability of the system is not threatened by security incidents. That reliability requires each safety instrumented function (SIF) to perform its intended task at the right moment; we are concerned with the availability and the integrity of the system.

 

072115_1313_Uncertainty1.png
The probability of failure on demand for a safety critical function usually depends on random error distributions and testing regimes. How can hacker threats be included in the thinking around reliability engineering? The goal is to remain confident in the reliability calculations, so that quantitative risk calculations are still meaningful.

 

In order to understand the threats to your system you need to start with the company and its place in the world, and in the supply chain. What does the company do? Consider an oil producer active in a global upstream market – producing offshore, onshore, as well as from unconventional sources such as tar-sands, arctic fields and shale oil. The company is also investing heavily in Iraq, including areas recently captured from ISIS. Furthermore, on the owner side of this company you find a Russian oligarch, who is known to be close to the Kremlin, as a majority stock holder. The firm is listed on the Hong Kong stock Market. Its key suppliers are Chinese engineering firms and steel producers, and its top customers are also Chinese government-backed companies. How does all of this affect the threat landscape as it applies to this firm?

The firm is interfering with causes that may trigger the interest of hacktivists:

  • Unconventional oil production
  • Arctic oil production

It also operates in an area that can make them a target for terrorist groups, in one of the most politically unstable regions in the world, where the world’s largest military powers also have to some degree opposing interests. This could potentially draw the interest of both terrorist groups and of nation state hackers. It is also worth noting that the company is on good terms with both the Russian and Chinese governments, two countries often accused of using state sponsored hackers to target companies in the west. The largest nation state threat to this oil company may thus be from western countries, including the one headed by Donald Trump. He has been quite silent on cybersecurity after taking office but issued statements during his campaign in 2016 hinting at more aggressive build-ups of offensive capacities. So, the company itself should at least expect the interest of script kiddies, hacktivists, cybercriminals, terrorists, nation states and insiders. These groups have quite varying capacities and the SIS is typically hard to get at due to multiple firewalls and network segregations. Our main focus should thus be of hacktivists, terrorists and nation states – with cybercriminals and insiders acting as proxies (knowingly or not).

The end in mind: keeping safety-critical systems reliable also under attack, or at least make it an insignificant contribution to unreliability.

Granularity of security assessment

Our goal of this discussion was to find the right depth level for risk and vulnerability assessments under IEC 61511. If we start with the threat actors and their capabilities, we observe some interesting issues:

  • Nation states: capable of injecting unknown features into firmware and application software at the production stage, including human infiltration of engineering teams. This can also be “features” sanctioned by the producer in some countries. Actual operations can include cyberphysical incursions with real asset destruction.
  • Terrorists: infiltration of vendors less likely. Typical capabilities are ATP’s using phishing to break the attack surface, and availability attacks through DDoS provided the SIS can be reached. Physical attack is also highly likely.
  • Cybercriminals: similar to terrorists, but may also have more advanced capabilities. Can also act out of own interest, e.g. through extortion schemes.
  • Hacktivists: unlikely to threaten firmware and software integrity. Not likely to desire asset damage as that can easily lead to pollution, which is in conflict with their likely motivations. DDoS attacks can be expected, SIS usually not exposed.

Some of these actors have serious capabilities, and it is possible that they will be used if the political climate warrants this. As we are most likely relying on procured systems form established vendors, using limited variability languages for the SIS, we have little influence over the low-level software engineering. Configurations, choice of blocks and any inclusion of custom-designed software blocks is another story. Regarding our assessment we should thus, at least, include the following aspects:

  • Procurement – setting security requirements and general information security requirements, and managing the follow-up process and cross-organizational competence management.
  • Software components – criticality assessment. Extra testing requirements to vendors. Risk assessment including configuration items.
  • Architectural security – network segregation, attack surface exposure, monitoring, security technologies, responsible organizations and network operations
  • Hardware – tampering risk, exposure to physical attacks, ports and access points, network access points including wireless (VSAT, microwave, GSM, WiFi)
  • Organizational security risks: project organization, operations organization. Review of roles and responsibilities, criticality of key personnel, workload aspects, contractual interfaces, third-party personnel.

Summary

This post does not give a general procedure for depth of analysis decisions but it does outline important factors. Always start with the context to judge both impact and expected actions from threat actors. Use this to determine capabilities of the main threat actors. This will help you decide the granularity level of your assessment. The things that are outside of your control should also not be neglected by considered an uncertainty point that may influence the necessary security controls you need to put in place.

 

granularity
A sketch of key factors to include when deciding on the granularity for a cybersecurity risk assessment under IEC 61511

 

 

 

Users: from threats to security enhancers?

Security should be an organization-wide effort. That means getting everyone to play the same game, requiring IT to stop thinking about “users as internal threats”, and start instead to think about “internal customers as security enhancers”. This can only be achived by using balanced security measures, involving the internal customers (users) through sharing the risk picture, and putting risk based thinking behind security planning to drive rational and balanced decisions. For many organiations with a pure compliance focus this can be a challenging journey – but the rewards at the end is an organization better equipped to tackle a dynamic threat landscape. 

Users have traditionally been seen as a major threat in information security settings. The insider threat is a very real thing but this does not mean that the user is the threat as such. There has recently been much discussion about how we can achieve a higher degree of cybersecurity maturity in organizations, and whether cybersecurity awareness training really works. This post does not give you the answers but describes some downsides to the compliance oriented tradition. The challenge is to find a good balance between controls and compliance on one side, and driving a positive security culture on the other.

wp-image-10420430jpg.jpg
Locking down users’ tools too much may enhance security on paper – until you you consider the effect of trust erosion on the human attack surface. When knowledge workers feel they are not trusted by their organization, they also feel undervalued, something that can create hostility towards management in general, and cybersecurity policies specifically. There is no fun working in an environment where all the toys are locked down. 

Most accidents involve “human error” as part of the accident chain, pretty much like most security breaches also involve some form of human error, typically a user failing to spot a social engineering attempt where the security technology is also inept at making good protection decisions. Email is still the most common malware delivery method, and phishing would not work without humans on the other end. This picture is what your security department is used to seeing; the user performs some action that allows the attacker to penetrate the organization. Hence, the user is a threat. The cure for this is supposed to be cyberscurity awareness training teaching users not to open attachments from sketchy sources, not to click those links, not to use weak passwords and so on. The problem is just that this only partially works. Some people have even gone so far as to say that this is completely useless.

The other part of the story is the user that reports his or her computer is misbehaving, or that some resoures have become unavailable, or forwards spear-phishing attempts. Those users are complying with policy and allowing the organization to spot potential attempts of recon or attack before the fact, or at least realtively soon after a breach. These users are security enhancers, in the way security awareness training is trying to at least make users a little bit less dangerous.

Because people do risky things when possible, the typical IT department answer to the insider threat is to lock down every workstation as much as possible, to “harden it”, ie making the attack surface smaller. This attack surface view, however, only considers the technology, not the social component. If you lock down the systems more than what is felt necessary by the users, they will probably start opposing company policies. They will not be reporting suspicious activities as often anymore. They will go through the motions of your awareness training but little behavioral change is seen afterwards. You risk that shadow IT starts to take a hold of your business – that employees use their private cloud accounts, portable apps or private computers to do their jobs – because the tools they feel they need to do their jobs are locked down, made inflexible or simply unavailable by the IT department in order to “reduce the attack surface”. So, not only are you risking to prime your employees for social engineering attacks (angry employees are easier to manipulate), making your staff less able to benefit from your training courses, but you may also be significantly increasing the technical attack surface through shadow IT.

So what is the solution – to allow users to do whatever they want on the network, give the admin rights and no controls? Obviously a bad idea. Keywords are balanced measures, involvement and risk based thinking.

  • Balanced: there must be a balance between security and productivity. A full lockdown may be required for information objects of high value to the firm and with credible attack scenarios, but not every piece of data and every operation is in that category.
  • Involvement: people need to understand why security measures are in place to make sense of the measures. Most security measures are impractical to people just wanting to get the job done. Understanding the implications of a breach and the cost-benefit ratio of the measures in place greatly helps people motivate themselves to do what feels slightly impractical.
  • Risk based thinking: measures must be adequate to the risk posed to the organization and not exaggerated. The risk picture must be shared with the employees as part of the security communication – this is a core leadership responsibility and the foundation of security aware cultures.

In the end it comes down to respect. Respect other people for what they do, and what value they bring to the organization. Think of them as customers instead of users. Only drug dealers and IT departments refer to their customers as users (quoted from somewhere forgotten on the internet).

Do SCADA vulnerabilities matter?

Sometimes we talk to people who are responsible for operating distributed control systems. These are sometimes linked up to remote access solutions for a variety of reasons. Still, the same people do often not understand that vulnerabilities are still found for mature systems, and they often fail to take the typically simple actions needed to safeguard their systems.

For example, a new vulnerability was recently discovered for the Siemens Simatic CP 343-1 family. Siemens has published a description of the vulnerability, together with a firmware update to fix the problem: see Siemens.com for details.

So, are there any CP 343’s facing the internet? A quick trip to Shodan shows that, yes, indeed, there are lots of them. Everywhere, more or less.

shodan_cp343

Now, if you did have a look at the Siemens site, you see that the patch was available from release date of the vulnerability, 27 November 2015. What then, is the average update time for patches in a control system environment? There are no patch Tuesdays. In practice, such systems are patched somewhere from monthly to never, with a bias towards never. That means that the bad guys have lots of opportunities for exploiting your systems before a patch is deployed.

This simple example reinforces that we should stick to the basics:

  • Know the threat landscape and your barriers
  • Use architectures that protect your vulnerable systems
  • Do not use remote access where is not needed
  • Reward good security behaviors and sanction bad attitudes with employees
  • Create a risk mitigation plan based on the threat landscape and stick to it practice too

 

SIL and ballast systems

Working on floating oil and gas facilities, one question keeps popping up about ballast systems. Should they have SIL requirements, and what should in this case the requirements be? When seeking to establish requirements for such systems, several issues are uncovered. First of all, current designs of ballast systems are very robust due to evolution of designs and requirements in shipping over a long time. Further, the problem is much more complex than collecting a few well-defined failure modes with random error data leading to a given situation, as typically seen in may process industry type problem descriptions. This complexity depends on a number of factors, and some of them are specific to each ship or installation, such as location, ship traffic density or operating practices of personnel onboard. Therefore, any quantitative estimates of “error probabilities” contributing to an expected return frequency of critical events concerning the system will have significant uncertainties associated with them.

ship-storm1

A ballast system is used to maintain the stability of a ship or a floating hull structure under varying cargo loading conditions and in various sea conditions and ship drafts. Water is kept in tanks dispersed around the hull structure, and can be pumped in or out, or transfered between tanks, to maintain stability. Errors in ballasting operations can lead to loss of stability, which in the worst consequence means a sunken ship. The ballasting operation is normally a semi manual operation where a marine operator would use a loading computer to guide decisions about ballasting, and manually give commands to a computer based control system on where to transfer water into or out of a particular ballast tank. Because this is such a critical safety system it is a natural question to ask: “what are the performance requirements?”.

Ballast systems have been part of shipping for hundreds of years. Requirements for ballast systems are thus set in the classification rules of ship classification societies, such as Lloyd’s Register, DNV GL or ABS. These requirements are typically presecriptive in nature and focus on robustness and avoidance of common cause failures in the technology. Maritime classification societies do not refer to safety integrity levels but rely on other means of ensuring safey operation and reliability. Society has accepted this practice for years, for very diverse vessels ranging from oil tankers to passenger cruise ships.

In oil and gas operations, the use of safety integrity levels to establish performance requirements for instrumented safety functions is the norm, and standards such as IEC 61508 are used as the point of reference. The Norwegian Oil and Gas Association has made a guideline that is normally applied for installations in Norwegian waters, which offers a simplification of requirements setting based on “typical performance”. This guideline can be freely downloaded from this web page. This guideline states that for “start of ballasting for rig re-establishment”, ths system should conform to a SIL 1 requirement. The “system” is described as consisting of a ballast control node, 2 x 100% pumps and three ballast valves. In appendix A.12 of the guideline a description of this “sub-function” is given with a calculation of achievable performance.

It may be argued that this functional description is somewhat artificial because the ballast system on a production installation is normally operated more or less continously. The function is defined for a single ballast tank/compartment, irrespective of the number of tanks and the necessary load balancing for re-establishing stability. The Guideline 070 approach is based on “typical performance” of the safety system as it is defined, and is not linked directly to the required risk reduction provided from the system. Multiple approaches may be taken to assign safety integrity levels based on risk analysis, see for example IEC 61508. One such method that is particularly common in the process industries and the oil and gas industry is “layers or protection analysis”, or LOPA for short. In this type of study, multiple initating events can contribute to one hazard situation, for example “sunken ship due to loss of stability”. Multiple barriers or “independent protection layers” can be credited for reducing the risk of this hazard being realized. In order to use a risk based method for setting the integrity requirement, it is necessary to define what is an acceptable frequency of this event occurring. Let us say for the sake of the discussion that it is acceptable that the mean time between each “sunken ship due to loss of stability” is 1 million years. How can we reason about this to establish requirements for the ballast system? The functional requirement is that we should “be able to shift ballast loading to re-establish stability before condition is made unrecoverable”. In order to start analyzing this situation, we need to estimate how often we will have a condition that can lead to such an unrecoverable situation if not correctly managed. Let us consider three such “initiating events”:

  • Loading operator error during routine ballasting (human error)
  • Damage to hull due to external impact
  • Error in load computer calculations

Both of these situations depend on a number of factors. The probability that the loading operator will perform an erronous situation depends on stress levels, competence/training and management factors. A throrough analysis using “human reliability analysis” can be performed, or a more simplified approach may be taken. We may, for example, make the assumption that the average operator makes 1 error without noticing immediately every 100 years (this is an assumption – must be validated if used).

Damage to hull due to external impact would depend on the ship traffic density in the area, if there is a difficult political condition (war, etc.), or if you are operating in arctic environments where ice impact is likely (think Titanic). Again, you may do extensive analysis to establish such data, or make some assumptions based on expert judgment. For example, we may assume a penetrating ship collition every 100 years on average.

What about erros in load computer calculations? Do the operators trust the load computer blindly, or do they perform sanity checks? How was the load computer programmed? Is the software mature? Is the loading condition unusual? Many questions may be asked here as well. For the sake of this example, let us assume there is no contribution from the loading computer.

We are then looking at an average initiating event frequency of 0.1 for human errors and 0.01 for hull damage.

Then we should think about what our options for avoiding the accidental scenario are, given that one of the initiating events have already occurred. As “rig re-establishment” depends on the operator performing some action on the ballast system, key to such barriers is making the operator aware of the situation. One natural way to do this would be to install an alarm for indicating a dangerous ballast condition, and train the operator to respond. What is the reliability of this as a protection layer? The ballast function itself is what we are trying to set the integrity requirement for, and any response of the operator requires this system to work. Simply notifying the operator is thus necessary but not enough for us. In case the ballast system fails when the operator tries to rectify the situation, the big question is, does the operator have a second option? Such options may be a redundant ballast system, not using the same components to avoid common cause failure. In most situations the dynamics will be slow enough to permit manual operation of pumps and valves from local control panels. This is a redundant option if the operator is trained for it. If the alarm does not use the same components as the function itself, we have an independent protection layer. The reliability of this, put together with the required response of a well-trained operator cannot be credited as better than a 90% success rate in a critical situation (ref. IEC 61511, for example).

So, based on this super-simplified analysis, are we achiving our required MTTF of 1 million years?

Events per year: 0.02.

Failure in IPL: Alarm + operator response using local control panels: 0.1.

OK, se we are achieving an MTTF of:

1/(0.02 x 0.1) = 500 years.

This is pretty far from where we said we should be. First of all, this would require our ballast system to operate with better than SIL 4 performance (which is completely unrealistic), and furthermore, it includes the same operator again performing manual actions. Of course, considering how many ships are floating at sea and how few of them are sinking, this is probably a quite unrealistic picture of the real risk. Using super-simple tools for adressing complex accidental scenarios is probably not the best solution. For example, the hull penetration scenario itself has lots of complexity – penetrating a single compartment will not threaten global stability. Furthermore, the personnel will have time to analyze and act on the situation before it develops into an unrecoverable loss of stability – but the reliability of them doing so depends on a lot on their training, competence and the installation’s leadership.

The take-away points from this short discussion are three:

  • Performance of ballast systems on ships is very good due to long history and robust designs
  • Setting performance requirements based on risk analysis requires a more in-depth view of the contributing factors (initators and barriers)
  • Uncertainty in quantiative measures is very high in part due to complexity and installation specific factors, aiming for “generally accepted” technical standards is a good starting point.

Is the necessary SIL related to layers of protection or operating practices?

A safety integrity level is a quantification of the necessary risk reduction we need from an automated safety system to achieve acceptable risk levels for some industrial system. The necessary risk reduction, obviously depends also on other activities and systems we put in place to reduce risk from its “intrinsic” level. The following drawing illustrates the role of different things we can do to achieve acceptable risk for a technical asset.

Figure showing how risk reducing measures work together to bring the risk down to an acceptable level.
Figure showing how risk reducing measures work together to bring the risk down to an acceptable level.

Consider for example a steel tank that is filled with pressurized gas. One potential hazard here is overpressure in the tank, which may cause a leak and the gas can be both toxic and flammable – obviously a dangerous situation. When working with risk, we need to define what we mean by risk in terms of “acceptance criteria”. In this case, we may say that we accept an explosion due to leak of gas and ignition of the gas afterwords once every one million years – that is a frequency of 10-6 per year. The initial frequency is maybe 0.1 per year, if the source of the high pressure is a controller intended to keep the pressure steady over time by adjusting a valve. Normally, such process control loops have one malfunction every 10 years (a coarse rule of thumb). Passive technologis can here be a spring-loaded safety valve that would open on high pressure and let the gas out to a safe location, for example a flare system where the gas can be burnt off in a controlled manner. This reduces the probability by 99% (such a passive valve tends to fail no more often than 1 out of 100 times). In addition to this, there is an independent alarm on the tank, giving a message to an operator in a control room that the pressure is increasing, and the oprator has time to go and check what is going on, and shut off supply of gas to the tank by closing a manual valve. How reliabile is this operator? With sufficient time, and allowing for some confusion due to stress, we may claim that the operator manages to intervene 9 out of 10 times (such numbers can be found by looking at human reliability analysis – a technique for assessing performance of trained people under various situations – developed primarily within the nuclear industry). In addition, a terrible explosion does not automatically happen if there is a leak – something needs to ignite the gas. Depending on the ignition sources we can assign a probability to this (models exist). For this case, let us assume the probability of ignition of a gas cloud in this location is 10%. We have now reduced the probability of this occuring by a factor of 1000 from an initial “intrinsic” frequency of 0.01. The frequency of such explosions due to leak in the tank before using any automatic shutdown system is thus 0.01 x 0.001 = 0.00001 = 10-5. The remaining reduction needed to bring the frequency down to 1 in a million years for the explosion is then an automated shutdown function that does not fail more than 1 out 10 demands – a PFD of 0.1. This means, we need a safety instrumented function with a probability of failure on demand of 0.1 – which corresponds to a SIL 1 requirement. The process we used to deduce this number is by the way known as a LOPA – a layers of protection analysis. The LOPA is one of many tools in the engineer’s toolbox for performing risk assessments.

What this illustrates is that the requirement to an automated shutdown function depends on other risk mitigation efforts – and the reliability of those barrier elements. What if the operator does not have time to intervene or cannot be trusted? If we take away the effect of the operator’s actions we see immediately that we need a SIL 2 function to achieve acceptable level of safety.