Impact of OT attacks: death, environmental disasters and collapsing supply-chains

Securing operational technologies (OT) is different from securing enterprise IT systems. Not because the technologies themselves are so different – but the consequences are. OT systems are used to control all sorts of systems we rely on for modern society to function; oil tankers, high-speed trains, nuclear power plants. What is the worst thing that could happen, if hackers take control of the information and communication technology based systems used to operate and safeguard such systems? Obviously, the consequences could be significantly worse than a data leak showing who the customers of an adult dating site are. Death is generally worse than embarrassment.

No electricity could be a consequence of an OT attack.

When people think about cybersecurity, they typically think about confidentiality. IT security professionals will take a more complete view of data security by considering not only confidentiality, but also integrity and availability. For most enterprise IT systems, the consequences of hacking are financial, and sometimes also legal. Think about data breaches involving personal data – we regularly see stories about companies and also government agencies being fined for lack of privacy protections. This kind of thinking is often brought into industrial domains; people doing risk assessments describe consequences in terms such as “unauthorized access to data” or “data could be changed be an unauthorized individual”.

The real consequences we worry about are physical. Can a targeted attack cause a major accident at an industrial plant, leaking poisonous chemicals into the surroundings or starting a huge fire? Can damage to manufacturing equipment disrupt important supply-chains, thereby causing shortages of critical goods such as fuels or food? That is the kind of consequences we should worry about, and these are the scenarios we need to use when prioritizing risks.

Let’s look at three steps we can take to make cyber risks in the physical world more tangible.

Step 1 – connect the dots in your inventory

Two important tools for cyber defenders of all types are “network topologies” and “asset inventory”. If you do not have that type of visibility in place, you can’t defend your systems. You need to know what you have to defend it. A network topology is typically a drawing showing you what your network consists of, like network segments, servers, laptops, switches, and also OT equipment like PLC’s (programmable logic curcuits), pressure transmitters and HMI’s (human-machine interfaces – typically the software used to interact with the sensors and controllers in an industrial plant). Here’s a simple example:

An example of a simplified network topology

A drawing like this would be instantly recognizable to anyone working with IT or OT systems. In addition to this, you would typically want to have an inventory describing all your hardware systems, as well as all the software running on your hardware. In an environment where things change often, this should be generated dynamically. Often, in OT systems, these will exist as static files such as Excel files, manually compiled by engineers during system design. It is highly likely to be out of date after some time due to lack of updates when changes happen.

Performing a risk assessment based on these two common descriptions is a common exercise. The problem is, that it is very hard to connect this information to the physical consequences we want to safeguard against. We need to know what the “equipment under control” is, and what it is used for. For example, the above network may be used to operate a batch chemical reactor running an exothermic reaction. That is, a reaction that produces heat. Such reactions need cooling, if not the system could overheat, and potentially explode as well if it produces gaseous products. We can’t see that information from the IT-type documentation alone; we need to connect this information to the physical world.

Let’s say the system above is controlling a reactor that has a heat-producing reaction. This reactor needs cooling, which is provided by supplying cooling water to a jacket outside the actual reactor vessel. A controller opens and closes a valve based on a temperature measurement in order to maintain a safe temperature. This controller is the “Temperature Control PLC” in the drawing above. Knowing this, makes the physical risk visible.

Without knowing what our OT system controls, we would be led to think about the CIA triad, not really considering that the real consequences could be a severe explosion that could kill nearby workers, destroy assets, release dangerous chemical to the environment, and even cause damage to neighboring properties. Unfortunately, lack of inventory control, especially connecting industrial IT systems to the physical assets they control, is a very common problem (disclaimer – this is an article from DNV, where I am employed) across many industries.

An example of a physical system: a continuously stirred-tank reactor (CSTR) for producing a chemical in a batch-type process.

Step 1 – connect the dots: For every server, switch, transmitter, PLC and so on in your network, you need to know what jobs these items are a part of performing. Only that way, you can understand the potential consequences of a cyberattack against the OT system.

Step 2 – make friends with physical domain experts

If you work in OT security, you need to master a lot of complexity. You are perhaps an expert in industrial protocols, ladder logic programming, or building adversarial threat models? Understanding the security domain is itself a challenge, and expecting security experts to also be experts in all the physical domains they touch, is unrealistic. You can’t expect OT security experts to know the details of all the technologies described as “equipment under control” in ISO standards. Should your SOC analyst be a trained chemical engineer as well as an OT security expert? Or should she know the details of steel strength decreases with a temperature increase due to being engulfed in a jet fire? Of course not – nobody can be an expert at everything.

This is why risk assessments have to be collaborative; you need to make sure you get the input from relevant disciplines when considering risk scenarios. Going back to the chemical reactor discussed above, a social engineering incident scenario could be as follows.

John, who works as a plant engineer, receives a phishing e-mail that he falls for. Believing the attachment to be a legitimate instruction of re-calibration of the temperature sensor used in the reactor cooling control system, he executes the recipe from the the attachment in the e-mail. This tells him to download a Python file from Dropbox folder, and execute it on the SCADA server. By doing so, he calibrates the temperature sensor to report 10 degrees lower temperature than what it really measures. It also installs a backdoor on the SCADA server, allowing hackers to take full control of it over the Internet.

The consequences of this could potentially overpressurizing the reactor, causing a deadly explosion. The lack of cooling on the reactor would make a chemical engineering react, and help understand the potential physical consequences. Make friends with domain experts.

Another important aspect of domain expertise, is knowing the safety barriers. The example above was lacking several safety features that would be mandatory in most locations, such as having a passive pressure-relief system that works without the involvement of any digital technologies. In many locations it is also mandatory to have a process shutdown systems, a control system with separate sensors, PLC’s and networks to intervene and stop the potential accident from happening by using actuators also put in place only for safety use, in order to avoid common cause failures between normal production systems and safety critical systems. Lack of awareness of such systems can sometimes make OT security experts exaggerate the probability of the most severe consequences.

Step 2 – Make friends with domain experts. By involving the right domain expertise, you can get a realistic picture of the physical consequences of a scenario.

Step 3 – Respond in context

If you find yourself having to defend industrial systems against attacks, you need an incident response plan. This is no different from an enterprise IT environment; you also need an incident response plan that takes the operational context into account here. A key difference, though, is that for physical plants your response plan may actually involve taking physical action, such as manually opening and closing valves. Obviously, this needs to be planned – and exercised.

If welding will be a necessary part of handling your incident, coordinating with the industrial operations side better be part of your incident response playbooks.

Even attacks that do not affect OT systems directly, may lead to operational changes in the industrial environment. Hydro, for example, was hit with a ransomware attack in 2019, crippling its enterprise IT systems. This forced the company to turn to manual operations of its aluminum production plants. This bears lessons for us all, we need to think about how to minimize impact not just after an attack, but also during the response phase, which may be quite extensive.

Scenario-based playbooks can be of great help in planning as well as execution of response. When creating the playbook we should

  • describe the scenario in sufficient detail to estimate affected systems
  • ask what it will take to return to operations if affected systems will have to be taken out of service

The latter question would be very difficult to answer for an OT security expert. Again, you need your domain expertise. In terms of the cyber incident response plan, this would lead to information on who to contact during response, who has the authority to make decision about when to move to next steps, and so on. For example, if you need to switch to manual operations in order to continue with recovery of control system ICT equipment in a safe way, this has to be part of your playbook.

Step 3 – Plan and exercise incident response playbooks together with industrial operations. If valves need to be turned, or new pipe bypasses welded on as part of your response activities, this should be part of your playbook.

OT security is about saving lives, the environment and avoiding asset damage

In the discussion above it was not much mention of the CIA triad (confidentiality, integrity and availability), although seen from the OT system point of view, that is still the level we operate at. We still need to ensure only authorized personnel has access to our systems, we need to ensure we protect data during transit and in storage, and we need to know that a packet storm isn’t going to take our industrial network down. The point we want to make is that we need to better articulate the consequences of security breaches in the OT system.

Step 1 – know what you have. It is often not enough to know what IT components you have in your system. You need to know what they are controlling too. This is important for understanding the risk related to a compromise of the asset, but also for planning how to respond to an attack.

Step 2 – make friends with domain experts. They can help you understand if a compromised asset could lead to a catastrophic scenario, and what it would take for an attacker to make that happen. Domain experts can also help you understand independent safety barriers that are part of the design, so you don’t exaggerate the probability of the worst-case scenarios.

Step 3 – plan your response with the industrial context in mind. Use the insight of domain experts (that you know are friends with) to make practical playbooks – that may include physical actions that need to be taken on the factory floor by welders or process operators.

Does cyber insurance make sense?

Insurance relies on pooled risk; when a business is exposed to a risk it feels is not manageable with internal controls, the risk can be deferred to the capital markets through an insurance contract. For events that are unlikely to hit a very large number of insurance customers at once, this model makes sense. The pooled risk allows the insurer to create capital gains on the premiums paid by the customers, and the customers get their financial losses covered in case of a claim situation. This works very well for many cases, but insurers will naturally try to limit their liabilities, through “omissions clauses”; things that are not covered by the insurance policy. The omissions will typically include catastrophic systemic events that the insurance pool would not have the means to cover because a large number of customers would be hit simultaneously. It will also include conditions with the individual customer causing the insurance coverage to be voided – often referred to as pre-existing conditions. A typical example of the former is damages due to acts of war, or natural disasters. For these events, the insured would have to buy extra coverage, if at all offered. An example of the latter omission type, the pre-existing condition, would be diseases the insured would have before entering into a health insurance contract.

20150424_155646090_iOS
Risk pooling works well for protecting the solvency of insurers when issuing policies covering rare independent events with high individual impact – but is harder to design for covering events where there is systemic risk involved. Should insurers cover the effects of large-scale virus infections like WannaCry over normal cyber-insurance policies? Can they? What would the financial aggregate cost of a coordinated cyber-attack be on a society when major functions collapse – such as the Petya case in the Ukraine? Can insurers cover the cost of such events?

How does this translate into cyber insurance? There are several interesting aspects to think about, in both omissions categories. Let us start with the systemic risk – what happens to the insurance pool if all customers issue claims simultaneously? Each claim typically exceed the premiums paid by any one single customer. Therefore, a cyberattack that spreads to large portions of the internet are hard to insure while keeping the insurer’s risk under control. Take for example the WannaCry ransomware attack in May; within a day more than 200.000 computers in 150 countries were infected. The Petya attack following in June caused similar reactions but the number of infected computers is estimated to be much lower. As the WannaCry still looks like a poorly implemented cybercrime campaign intended to make money for the attacker, the Petya ransomware seems to have been a targeted cyberweapon used against  the Ukraine; the rest was collateral damage, most likely. But for Ukrainian companies, the government and computer users this was a major attack: it took down systems belonging to critical infrastructure providers, it halted airport operations, it affected the government, it took hold of payment terminals in stores; the attack was a major threat to the entire society. What could a local insurer have done if it had covered most of those entities against any and all cyberattacks? It would most likely not have been able to pay out, and would have gone bankrupt.

A case that came up in security forums after the WannaCry attack was “pre-existing condition” in cyber insurance. Many policies had included “human error” in the omissions clauses; basically, saying that you are not covered if you are breached through a phishing e-mail.  Some policies also include an “unpatched system” as an omission clause; if you have not patched, you are not covered. Not all policies are like that, and underwriters will typically gauge a firm’s cyber security maturity before entering into an insurance contract covering conditions that are heavily influenced by security culture. These are risks that are hard to include in quantitative risk calculations; the data are simply not there.

Insurance is definitely a reasonable control for mature companies, but there is little value in paying premiums if the business doesn’t satisfy the omissions clauses. For small businesses it will pay off to focus on the fundamentals first, and then to expand with a reasonable insurance policy.

For insurance companies it is important to find reasonable risk pooling measures to better cover large-scale infections like WannaCry. Because this is a serious threat to many businesses, not having meaningful insurance products in place will hamper economic growth overall. It is also important for insurers to get a grasp on individual omissions clauses – because in cyber risk management the thinking around “pre-existing condition” is flawed – security practice and culture is a dynamic and evolving thing, which means that the coverage omissions should be based on current states rather than a survey conducted prior to policy renewal.