Zero-Day OT Nightmare? How Zero-Trust Can Stop APT attacks

It was a crisp summer Monday, and Alex, the maintenance engineer at Pulp Friction Paper Company, arrived with his coffee, ready to tackle the day. He reveled in the production regularity achieved thanks to his recently implemented smart maintenance program. This program used machine learning to anticipate condition degradation in machinery, a significant improvement over the facility’s previous reliance on traditional periodic maintenance or the ineffective risk-based approaches.

Alex, a veteran at Pulp Friction, had witnessed the past struggles. Previously, paper products were frequently rejected due to inconsistencies in humidity control, uneven drying, or even mechanical ruptures. He was a firm believer in leveraging modern technology, specifically AI, to optimize factory operations. While not a cybersecurity expert, his awareness wasn’t limited to just using technology. He’d read about the concerning OT (Operational Technology) attacks in Denmark last year, highlighting the inherent risks of interconnected systems.

As a seasoned maintenance professional, Alex understood the importance of anticipating breakdowns for effective mitigation. He empathized with the security team’s constant vigilance against zero-day attacks – those unpredictable, catastrophic failures that could turn a smooth operation into a major incident overnight.

Doctors analysing micro blood samples on advanced chromatographic paper form
Pulp Friction Paper Mills

Dark clouds over Pulp Friction

A digital phantom stalked the web. “The Harvester,” a notorious APT (Advanced Persistent Threat) group known for targeting high-value assets, had Pulp Friction Paper Company in their crosshairs. Their prize? Not paper, but a revolutionary innovation: medical diagnostic paper. Pulp Friction had recently begun producing these specialized sheets, embedded with advanced materials, for use in chromatographic tests. This cutting-edge technology promised rapid diagnosis of a multitude of diseases from mere microliter blood samples, a potential game-changer in the medical field. Unbeknownst to Alex, a gaping zero-day vulnerability resided within the facility’s industrial control system (ICS) software. If exploited, The Harvester could wreak havoc, disrupting production of these life-saving diagnostic tools and potentially delaying critical medical care for countless individuals. The stakes had just been raised. Could Alex, with his limited cybersecurity awareness, and the current defenses, thwart this invisible threat and ensure the smooth flow of this vital medical technology?

A wave of unease washed over Alex as he stared at the malfunctioning control panel. The usually predictable hum of the paper production line had been replaced by a cacophony of alarms and erratic readings. Panic gnawed at him as vital indicators for the chromatographic test paper production process lurched erratically. This wasn’t a typical equipment malfunction – it felt deliberate, almost malicious.

Just then, a memory flickered in Alex’s mind. Sarah, the friendly and highly skilled network security specialist he occasionally consulted with, had been pushing for a new security system called “zero-trust.” While Alex appreciated Sarah’s expertise, he hadn’t quite understood the nuances of the system or its potential benefits. He’d brushed it off as an extra layer of complexity for an already demanding job.

Now, regret gnawed at him alongside the growing sense of dread. Grabbing his phone, Alex dialed Sarah’s number, his voice laced with a tremor as he blurted out, “Sarah, something’s terribly wrong with the ICS system! The readings are all messed up, and I don’t know what’s happening!” The urgency in his voice was impossible to miss, and Sarah, sensing the dire situation, promised to be there as soon as possible. With a heavy heart, Alex hung up, the echo of his own ignorance a stark reminder of the consequences he might have inadvertently unleashed by ignoring the recommendations on network security improvements.

The Harvester: a technical intermezzo

Diagram made with mermaid.live

The Harvester is capable of zero-day research and exploit development. In this attack they are targeting companies using advanced technologies to supply to healthcare providers – and many of those companies use innovative maintenance systems.

They first find a web exposed server used by the AI driven maintenance system. The system is Internet exposed due to frequent need for access by multiple vendors. By exploiting the vulnerability there, they gain root access to the underlying Linux operating system. The Harvester, like many other threat actors, then install a web shell for convenient persistent access, and continue to move on using conventional techniques. Reaching the engineering workstation, the attacker is able to reprogram PLC’s, and disable safety features. Having achieved this, the system is no longer a highly reliable production system for diagnostic test paper: it is a bleeping mess spilling water, breaking paper lines and causing a difficult-to-fix mess.

They continue to pose as Pulp Friction employees, leaking CCTV footage of the mess on the factory floor, showing panicking employees running around, and also post on social media claiming Pulp Friction never cared about reliability or security, and that money was the only goal, without any regard for patient safety: this company should never be allowed to supply anything to hospitals or care providers!

What it took to get back to business

Sarah arrived at Pulp Friction, a whirlwind of focused energy. Immediately, she connected with Alex and reviewed the abnormal system behavior. Her sharp eyes landed on the internet access logs for the smart maintenance system – a system Alex had mentioned implementing. Bingo! This web-exposed system, likely the initial point of entry, was wide open to the internet. Without hesitation, Sarah instructed the IT team to isolate and disable the internet access for the maintenance system – a crucial first step in stemming the bleeding.

“The only thing necessary for the triumph of evil is for good men to do nothing.”

Edmund Burke

Cybersecurity meaning: “Don’t be a sitting duck the day the zero-day is discovered.”

Next, she initiated the full incident response protocol, securing compromised systems, isolating affected network segments, and reaching out to both the Pulp Friction IT team and external forensics experts. The following 48 hours were a blur – a symphony of collaboration. Sarah led the incident response, directing forensics on evidence collection and containment, while the IT team worked feverishly to restore services and patch vulnerabilities.

Exhausted but resolute, Sarah and Alex presented their findings to the CEO. The CEO, witnessing the team’s dedication and the potential consequences, readily approved Sarah’s plan for comprehensive security improvements, including implementing zero-trust and segmentation on the OT network, finally putting Pulp Friction on the path to a more robust defense. They couldn’t erase the attack, but they could ensure it wouldn’t happen again.

With the immediate crisis averted, Sarah knew a stronger defense was needed. She turned to Alex, his eyes reflecting a newfound appreciation for cybersecurity. “Remember zero-trust, Alex? The system I’ve been recommending?” Alex nodded, his earlier skepticism replaced by a desire to understand.

“Think of it like guarding a high-security building,” Sarah began. “No one gets in automatically, not even the janitor. Everyone, from the CEO to the maintenance crew, has to show proper ID and get verified every time they enter.”

Alex’s eyes lit up. “So, even if someone snuck in through a hidden door (like the zero-day), they wouldn’t have access to everything?”

“Exactly!” Sarah confirmed. “Zero-trust constantly checks everyone’s access, isolating any compromised systems. Imagine the attacker getting stuck in the janitor’s closet, unable to reach the control room.”

Alex leaned back, a relieved smile spreading across his face. “So, with zero-trust, even if they got in through that maintenance system, they wouldn’t be able to mess with the paper production?”

“Precisely,” Sarah said. “Zero-trust would limit their access to the compromised system itself, preventing them from reaching critical control systems or causing widespread disruption.”

With the analogy clicking, Alex was on board. Together, Sarah and Alex presented the zero-trust solution to the CEO, emphasizing not only the recent attack but also the potential future savings and improved operational efficiency. Impressed by their teamwork and Sarah’s clear explanation, the CEO readily approved the implementation of zero-trust and segmentation within the OT network.

Pulp Friction, once vulnerable, was now on the path to a fortress-like defense. The zero-day vulnerability might have been a wake-up call, but with Sarah’s expertise and Alex’s newfound understanding, they had turned a potential disaster into a catalyst for a much stronger security posture. As production hummed back to life, creating the life-saving diagnostic paper, a sense of quiet satisfaction settled in. They couldn’t erase the attack, but they had ensured it wouldn’t happen again.

How Alex and Sarah collaborated to achieve zero-trust benefits in the OT network

Zero-trust in the IT world relies a lot on identity and endpoint security posture. Both those concepts can be hard to implement in an OT system. This does not mean that zero-trust concepts have no place in industrial control systems, it just means that we have to play within the constraints of the system.

  • Network segregation is critical. Upgrading from old firewalls to modern firewalls with strong security features is a big win.
  • Use smaller security zones than what has been traditionally accepted
  • For Windows systems in the factory, on Layer 3 and the DMZ (3.5) in the Purdue model, we are primarily dealing with IT systems. Apply strong identity controls, and make patchable systems. The excuse is often that systems cannot be patched because we allow no downtime, but virtualization and modern resilient architectures allow us to do workload management and zero-downtime patching. But we need to plan for it!
  • For any systems with weak security features, compensate with improved observability
  • Finally, don’t expose things to the Internet. Secure your edge devices, use DMZ’s, VPN’s, and privileged access management (PAM) systems with temporary credentials.
  • Don’t run things as root/administrator. You almost never need to.

In a system designed like this, the maintenance server would not be Internet exposed. The Harvester would have to go through a lot of hoops to land on it with the exploit. Assuming the threat actor manages to do that through social engineering and multiple hops of lateral movement, it would still be very difficult to move on from there:

  • The application isn’t running as root anymore – only non-privileged access
  • The server is likely placed in its own information management zone, receiving data through a proxy or some push/pull data historian system. Lateral movement will be blocked on the firewall, or at least require a hard-to-configure bypass.
  • The engineering workstations are isolated and not network reachable without a change request for firewall rules. Getting to the place the settings and logic can be changed gets difficult.
  • The PLC’s are configured to not be remotely programmable without a physical change (like a physical key controlling the update mode).

Using Sarah’s plan, the next time The Harvester comes along, the bad guy is turned away at the door, or gets locked into the janitor’s closet. The diagnostic paper is getting shipped.

Key take-aways:

  1. Exposing critical systems directly on the Internet is not a good idea, unless it is meant to be a web service engineered for that type of hostile environment
  2. Zero-trust in OT systems is possible, and is a good strategy to defend against zero-days.
  3. Defenders must be right all the time, hackers only need to get lucky once is a lie – if you implement good security architecture. Lucky once = locked into the janitor’s closet.

Secure Multihomed Devices in Control Networks: Managing Risks and Enhancing Resilience

In control networks, where ensuring constant communication and reliable operation is critical, devices are frequently configured to be multihomed. This means they possess connections to multiple separate networks. This approach is favored over traditional routing methods where traffic is passed between networks. The advantage lies in the redundancy and potential performance boost multihoming offers. If one connection malfunctions, the device can seamlessly switch to another, maintaining vital communication within the control network. Additionally, multihoming allows for the possibility of utilizing different networks for specific traffic types, potentially optimizing overall control network performance.

While multihoming offers redundancy and performance benefits in control networks, it introduces security risks if the connected networks are meant to be entirely separate. Here’s why:

  1. Bridging Separate Networks: A multihomed device acts like a bridge between the networks it’s connected to. If these networks should be isolated for security reasons (e.g., a control signal network and a configuration network), the multihomed device can unintentionally create a pathway for unauthorized access. A malicious actor on one network could potentially exploit vulnerabilities on the device to gain access to the otherwise isolated network.
  2. Policy Bypass: Firewalls and other security measures are typically implemented at network borders to control traffic flow. With a multihomed device, traffic can potentially bypass these security controls altogether. This is because the device itself can become a point of entry, allowing unauthorized traffic or data to flow between the networks, even if the network firewalls have proper rules in place.
  3. Increased Attack Surface: Each additional connection point represents a potential vulnerability. With a multihomed device, attackers have more opportunities to exploit weaknesses in the device’s security or configuration to infiltrate one or both networks.

Bypassing firewalls: an example

Consider a system with two networks, where traffic is routed through a firewall. Network B is considered critical for real-time operations and has primarily control system protocols such as Modbus. This network is not encrypted. Network A is primarily used for configuring systems and reprogramming controllers. Most of the traffic is encrypted. Remote access is accepted into Network A, but not Network B.

On the firewall, all traffic between A and B is blocked during normal operation. When a controller in network B needs to be updated, a temporary firewall rule to allow the traffic is added.

Computer 2 i multi-homed and can be used to bypass the firewall

Along comes the adversary, and managed to use remote access to compromise Computer 1, and take over a local administrator account. Then the attacker moves laterally to Computer 2 using the Network A interface, managing to secure an SSH shell to Computer 2. From this shell, the attacker now has access to the control network over the second network interface, and executes a network scan from Computer 2 to identify the devices in Network B. Moving from there, the attacker is able to manipulated devices and network traffic to cause physical disruption, and the plant shuts down.

What are the options?

Your options to reduce the risk from multihomed devices may be limited, but keeping it like the example above is definitely risky.

  • The ideal solution: Remove any multi-homed setups, and route all traffic through the firewall. This way you have full control of what traffic is allowed. This may not be possible if the latency added is too much but this is a rare constraint.
  • The micro-segmented solution: Keep the network interfaces but add stateless firewalls on each network card to limit the traffic. Then the multi-homed device becomes its own network segment. Using this to implement a default deny policy will greatly improve the security of the solution.
  • Device hardening: This should be done for all the solutions, but can also be a solution in its own right. Keep the multi-homed behavior in place, but harden the device so that taking over it becomes really difficult. Disable all unused services, run all applications with minimal privileges, and used the host-based firewall to limit the traffic allowed (both ingress and egress).

Teaching smart things cyber self defense: ships and cars that fight back

We have connected everything to the Internet – from power plants to washing machines, from watches with fitness trackers to self-driving cars and even self-driving gigantic ships. At the same time, we struggle to defend our IT systems from criminals and spies. Every day we read about data breaches and cyber attacks. Why are we then not more afraid of cyber attacks on the physical things we have put into our networks?

  • Autonomous cars getting hacked – what if they crash into someone?
  • Autonomous ships getting hacked – what if they lose stability and sink due to a cyber attack or run over a small fishing boat?
  • Autonomous light rail systems – what if they are derailed at high speed due to a cyber attack?

Luckily, we are not reading about things like this in the news, at least not very often. There have been some car hacking mentioned, usually demos of possibilities. But when we build more and more of these smart systems that can cause disasters of control is lost, shouldn’t we think more about security when we build and operate them? Perhaps you think that someone must surely be taking care of that. But fact is, in many cases, it isn’t really handled very well.

How can an autonomuos vessel defend against cyber attacks?

What is the attack surface of an autonomous system?

The attack surface of an autonomous system may of course vary, but they tend to have some things in common:

  • They have sensors and actuators communicating with a “brain” to make decisions about the environment they operate in
  • They have some form of remote access and support from a (mostly) human operated control center
  • They require systems, software and parts from a high number of vendors with varying degree of security maturity

If we for the sake of the example consider an autonomous battery powered vessel at sea, such as ferry. Such a vessel will have multiple operating modes:

  • Docking to the quay
  • Undocking from the quay
  • Loading and offloading at quay
  • Journey at sea
  • Autonomous safety maneuvers (collision avoidance)
  • Autonomous support systems (bilge, ballast, etc)

In addition there will typically be a number of operations that are to some degree human led, such as search and rescue if there is a man over board situation, firefighting, and other operations, depending on the operating concept.

To support the operations required in the different modes, the vessel will need an autonomous bridge system, an engine room able to operate without a human engineer in place to maintain propulsion, and various support systems for charging, mooring, cargo handling, etc. This will require a number of IT components in place:

  • Redundant connectivity with sufficient bandwidth (5G, satellite)
  • Local networking
  • Servers to run the required software for the ship to operate
  • Sensors to ensure the ship’s autonomous system has good situational awareness (and the human onshore operators in the support center)

The attack surface is likely to be quite large, including a number of suppliers, remote access systems, people and systems in the remote control center, and remote software services that may run in private data centers, or in the cloud. The main keyword here is: complexity.

Defending against cyber piracy at high seas

With normal operation of the vessel, its propulsion and bridge systems would not depend on external connectivity. Although cyber attacks can also hit conventional vessels, much of the damage can be counteracted by seafarers onboard taking more manual control of the systems and turning off much of the “smartness”. With autonomous systems this is not always an option, although there are degrees of autonomy and it is possible to use similar mechanisms if the systems are semi-autonomous with people present to take over in case of unusual situations. Let’s assume the systems are fully autonomous and there is nobody onboard to take control of them.

Since there are no people to compensate for digital systems working against us, we need to teach the digital systems to defend themselves. We can apply the same structural approach to securing autonomous systems, as we do to other IT and OT systems; but we cannot rely on risk reduction from human local intervention. If we follow “NSM’s Grunnprinsipper for IKT-sikkerhet” (the Norwegian government’s recommendations for IT security, very similar to NIST’s cybersecurity framework), we have the following key phases:

  1. Identify: know what you have and the security posture of your system
  2. Protect: harden your systems and use security technologies to stop attackers
  3. Detect: set up systems so that cyber attacks can be detected
  4. Respond: respond to contain compromised systems, evict intruders, recover capabilities, improve hardening and return to normal operations

These systems are also operational technologies (OT). It may therefore be useful also to refer to IEC 62443 in the analysis of the systems, especially to assess the risk to the system, assign requires security levels and define requirements. Also the IEC 62443 reference architecture is useful.

It is not so that all security systems have to be working completely autonomously for an autonomous system, but it has to be more automated in a normal OT system, and also in most IT systems. Let’s consider what that could mean for a collision avoidance system on an autonomous vessel. The job of the collision avoidance system can be defined as follows:

  1. Detect other vessels and other objects that we are on collision course with
  2. Detect other objects close-by
  3. Choose action to take (turn, stop, reverse, alert on-shore control center, communicate to other vessels over radio, etc)
  4. Execute action
  5. Evaluate effect and make corrections if needed

In order to do this, the ship has a number of sensors to provide the necessary situational awareness. There has been a lot of research into such systems, especially collaborative systems with information exchange between vessels. There have also been pilot developments, such as this one https://www.maritimerobotics.com/news/seasight-situational-awareness-and-collision-avoidance by the Norwegian firm Maritime Robotics.

We consider a simplified view of how the collision avoidance system works. Sensors tell the anti collision system server about what it sees. The traffic is transmitted over proprietary protocols, some over tcp, some over udp (camera feeds). Some of the traffic is not encrypted, but all is transferred over the local network. The main system server is processing the data onboard the ship and making decisions. Those decisions go to functions in the autonomous bridge to take action, including sending radio messages to nearby ships or onshore. Data is also transmitted to onshore control via the bridge system. Onshore can use remote connection to connect to the collision avoidance server directly richer data, as well as overriding or configuring the system.

Identify

The system should automatically create a complete inventory of its hardware, software, networks, and users. This inventory must be available for automated decision making about security but also for human and AI agents working as security operators from onshore.

The system should also automatically keep track of all temporary exceptions and changes, as well as any known vulnerabilities in the system.

In other words: a real-time security posture management system must be put in place.

Protect

An attacker may wish to perform different types of actions on this vessel. Since we are only looking at the collision avoidance system here we only consider an adversary that wants to cause an accident. Using a kill-chain approach to our analysis, the threat actor thus has the following tasks to complete:

  • Recon: get an overview of the attack surface
  • Weaponization: create or obtain payloads suiteable for the target system
  • Delivery: deliver the payloads to the systems. Here the adversary may find weaknesses in remote access, perform a supply-chain attack to deliver a flawed update, use an insider to gain access, or compromise an on-shore operator with remote access privileges.
  • Execution: if a technical attack, automated execution will be necessary. For human based attacks, operator executed commands will likely be the way to perform malware execution.
  • Installation: valid accounts on systems, malware running on Windows server
  • Command and control: use internet connection to remotely control the system using installed malware
  • Actions on objectives: reconfigure sensors or collision avoidance system by changing parameters, uploading changed software versions, or turning the system off

If we want to protect against this, we should harden our systems as much as possible.

  • All access should require MFA
  • Segregate networks as much as possible
  • Use least privilege as far as possible (run software as non-privileged users)
  • Write-protect all sensors
  • Run up-to-date security technologies that block known malware (firewalls, antivirus, etc)
  • Run only pre-approved and signed code, block everything else
  • Remote all unused software from all systems, and disable built-in functionality that is not needed
  • Block all non-approved protocols and links on the firewall
  • Block internet egress from endpoints, and only make exceptions for what is needed

Doing this will make it very hard to compromise the system using regular malware, unless operations are run as an administrator that can change the hardening rules. It will most likely protect against most malware being run as an administrator too, if the threat actor is not anticipating the hardening steps. Blocking traffic both on the main firewall and on host based firewalls, makes it unlikely that the threat actor will be able to remove both security controls.

Detect

If an attacker manages to break into the anti-collision system on our vessel, we need to be able of detecting this fast, and responding to it. The autonomous system should ideally perform the detection on its own, without the need for a human analyst due to the need for fast response. Using human (or AI agents) onshore in addition is also a good idea. As a minimum the system should:

  • Log all access requests and authorization requests
  • Apply UEBA (user entity behavior analysis) to detect an unusual activity
  • Use advanced detection technologies such as security features of a NGFW, a SIEM with robust detection rules, thorough audit logging on all network equipment and endpoints
  • Use EDR technology to provide improved endpoint visibility
  • Receive and use threat intelligence in relevant technologies
  • Use deep packet inspection systems with protocol interpreters for any OT systems part of the anti-collision system
  • Map threat models to detection coverage to ensure anticipated attacks are detectable

By using a comprehensive detection approach to cyber events, combined with a well-hardened system, it will be very difficult for a threat actor to take control of the system unnoticed.

Respond and recover

If an attack is detected, it should be dealt with before it can cause any damage. It may be a good idea to conservatively plan for physical response also for an autonomous ship with a cybersecurity intrusion detection, even if the detection is not 100% reliable, especially for a safety critical system. A possible response could be:

  • Isolate the collision avoidance system from the local network automatically
  • Stop the vessel and maintain position (using DP if available and without security detections, and as a backup to drop anchor)
  • Alert nearby ships over radio that “Autonomous ship has lost anti-collision system availability and is maintaining position. Please keep distance. “
  • Alert onshore control of the situation.
  • Run system recovery

System recovery could entail securing forensic data, automatically analysing data for indicators of compromise and identify patient zero and exploitation path, expanding blast radius to continue analysis through pivots, reinstall all affected systems from trusted backups, update configurations and harden against exploitation path if possible, perform system validation, transfer back to operations with approval from onshore operations. Establishing a response system like this would require considerable engineering effort.

An alternative approach is to maintain position, and wait for humans to manually recover the system and approve returning to normal operation.

The development of autonomous ships, cars and other high-risk applications are subject to regulatory approval. Yet, the focus of authorities may not be on cybersecurity, and the competence of those designing the systems as well as the ones approving them may be stronger in other areas than cyber. This is especially true for sectors where cybersecurity has not traditionally been a big priority due to more manual operations.

A cyber risk recipe for people developing autonomous cyber-physical systems

If we are going to make a recipe for development of responsible autonomous systems, we can summarize this in 5 main steps:

  • Maintain good cyber situational awareness. Know what you have in your systems, how it works, and where you are vulnerable – and also keep track of the adversary’s intentions and capabilities. Use this to plan your system designs and operations. Adapt as the situation changes.
  • Rely on good practice. Use IEC 62443 and other know IT/OT security practices to guide both design and operation.
  • Involve the suppliers and collaborate on defending the systems, from design to operations. We only win through joint efforts.
  • Test continuously. Test your assumptions, your systems, your attack surface. Update defenses and capabilities accordingly.
  • Consider changing operating mode based on threat level. With good situational awareness you can take precautions when the threat level is high by reducing connectivity to a minimum, moving to lower degree of autonomy, etc. Plan for high-threat situations, and you will be better equipped to meet challenging times.

Avoid keeping sensitive info in a code repo – how to remove files from git version history

One of the vulnerabilities that are really easy to exploit is when people leave super-sensitive information in source code – and you get your hands on this source code. In early prototyping a lot of people will hardcode passwords and certificate keys in their code, and remove it later when moving to production code. Sometimes it is not even removed from production. But even in the case where you do remove it, this sensitive information can linger in your version history. What if your app is an open source app where you are sharing the code on github? You probably don’t want to share your passwords…

Key on keyboard
Don’t let bad guys get the key to your databases and other valuable files by searching old versions of your code in the repository.

Getting this sensitive info out of your repository is not as easy as deleting the file from the repo and adding it to the .gitignore file – because this does not touch your version history. What you need to do is this:

  • Merge any remote changes into your local repo, to make sure you don’t remove the work of your team if they have commited after your own last merge/commit
  • Remove the file history for your sensitive files from your local repo using the filter-branch command:

git filter-branch –force –index-filter \
‘git rm –cached –ignore-unmatch \
PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA‘ cat — –all

Although the command above looks somewhat scary it is not that hard to dig out – you can find in the the Github doc files. When that’s done, there’s only a few more things to do:

  • Add the files in question to your .gitignore file
  • Force write to the repo (git push origin –force –all)
  • Tell all your collaborator to clone the repo as a fresh start to avoid them merging in the sensitive files again

Also, if you have actually pushed sensitive info to a remote repository, particularly if it is an open source publicly available one, make sure you change all passwords and certificates that were included previously – this info should be considered compromised.


Like what you read? Sign up for free updates!

Integrating power grids: what does it do to cyber resilience?

There are two big trends in the power utilities business today – with opposing signs:

  • Addition of micro-producers and microgrids, making consumers less bound to the large grid operators
  • Increasing integration of power grids over large distances, allowing mega-powerplants to serve enormous areas

Both trends will have impact on grid resilience; the microgrids are usually connected to regional grids in order to sell surplus power, and the mega plants obviously require large grid investments as well. When we seek to understand the effect on resilience we need to examine two types of events:

  • Large-scale random event threatening the regularity of the power transmission capability
  • Large-scale attack by SCADA hackers that knock out production and transmission capacities over extended areas

We will not perform a structured risk assessment here but we will rather look at some possible effects of these trends when it comes to power regularity and (national?) security.

Infographic from statkraft.com abou Fosen Vind – Europe’s largest onshore wind project (1000 MW)

Recent events that are interesting to know about

Mega-plants and increasing grid integration

Power plants are in the wind, literally speaking. The push for renewables to come to the market is giving concrete large-scale investments. Currently we are seeing several interesting projects moving ahead:

In addition to this, we see that NERC, the American organization responsible for the reliability of the power grids in the United States, Canada and parts of Mexico are working to include Mexico as a full member. This will very likely lead to increased integration of the power transmission capacities across the U.S.-Mexico border, at least at the organizational and grid management levels.

Random faults and large-scale network effects

What happens to the transmission capacity when random faults occur? This depends on the redundancy built into the network, and the capacities of the remaining lines when one or more paths fail. As more of the energy mix moves towards renewables we are going to be even more dependent on a reliable transmission grid; renewable energy is hard to store, and the cost of high-capacity storage will add to the energy price, making renewable sources less competitive compared with fossil fuels.

If we start relying on mega plants, this is also going to make us depend more on a reliable grid. The network effects would have to be investigated using methods like Monte Carlo simulations (RAM analysis) but what we should expect is:

  • Mega plants will require redundancy in intercontinental grid connections to avoid blackouts if one route is down
  • Areas without access to base load energy supply would be more vulnerable than those that can supply their own energy locally
  • Prices will fluctuate over larger areas when energy production is centralized
  • Micro-grids and micro-production should alleviate some of the increased vulnerability for small consumers (like private households) but are unlikely to be an effective buffer for industrial consumers

Coordinated cyber warfare campaigns

Recent international events have brought cyber warfare to the forefront of politics. Recently it was suggested at the RSA conference that deterrence through information sharing and openness does not work, and we are not able to deny the intrusion of state sponsored hackers, so we need to respond in force to such attacks, including armed military response in the physical world.

Recent cyberattacks in this domain have been reported from conflict zones. The reports receiving the most attention in media are those coming out of the Ukraine, where the authorities have accused Russia to be responsible for a series of cyber-attacks, including the one causing a major blackout in parts of Ukraine in December 2015. For a nice summary of the Ukrainian situation, see this post on the cybersecurity blog from SANS.

Increasing cooperation across national borders can increase or resilience but at the same time it will make effects of attacks spread to larger regions. Depending on the security architecture of the network as a whole, attackers could be able of compromising entire continents, potentially damaging the defense capabilities of those countries severely as population morale is hit by the loss of critical infrastructure.

What should we do now?

There are many positive outcomes of increased integration and very large renewable energy producers – but we should not disregard risks, including the political ones. When building such plants and the grids necessary to serve customers we need to ensure sufficient redundancy exists to cope with partial fallouts in a reasonable manner. We should also build our grids such that we have a robust security architecture, with auditable rules to ensure security management is on par across borders. This is the strength of NERC. Cyber resilience considerations should be made also for other parts of the world. Perhaps it is time to lay the groundwork for international conventions on grid reliability and security before we end up connecting all our continents to the same electrical network.

Do SCADA vulnerabilities matter?

Sometimes we talk to people who are responsible for operating distributed control systems. These are sometimes linked up to remote access solutions for a variety of reasons. Still, the same people do often not understand that vulnerabilities are still found for mature systems, and they often fail to take the typically simple actions needed to safeguard their systems.

For example, a new vulnerability was recently discovered for the Siemens Simatic CP 343-1 family. Siemens has published a description of the vulnerability, together with a firmware update to fix the problem: see Siemens.com for details.

So, are there any CP 343’s facing the internet? A quick trip to Shodan shows that, yes, indeed, there are lots of them. Everywhere, more or less.

shodan_cp343

Now, if you did have a look at the Siemens site, you see that the patch was available from release date of the vulnerability, 27 November 2015. What then, is the average update time for patches in a control system environment? There are no patch Tuesdays. In practice, such systems are patched somewhere from monthly to never, with a bias towards never. That means that the bad guys have lots of opportunities for exploiting your systems before a patch is deployed.

This simple example reinforces that we should stick to the basics:

  • Know the threat landscape and your barriers
  • Use architectures that protect your vulnerable systems
  • Do not use remote access where is not needed
  • Reward good security behaviors and sanction bad attitudes with employees
  • Create a risk mitigation plan based on the threat landscape and stick to it practice too

 

New security requirements to safety instrumented systems in IEC 61511

IEC 61511 is undergoing revision and one of the more welcome changes is inclusion of cyber security clauses. According to a presentation held by functional safety expert Dr. Angela Summers at the Mary Kay Instrument Symposium in January 2015, the following clauses are now included in the new draft – the standard is planned issued in 2016:

  • Clause 8.2.4: Description of identified [security] threats for determination of requirements for additional risk reduction. There shall also be a description of measures taken to reduce or remove the hazards.
  • Clause 11.2.12: The SIS design shall provide the necessary resilience against the identified security risks

What does this mean for asset owners? It obviously makes it a requirement to perform a cyber security risk assessment for the safety instrumented systems (SIS). Such information asset risk assessments should, of course, be performed in any case for automation and safety systems. This, however, makes it necessary to keep security under control to obtain compliance with IEC 61511 – something that is often overlooked today, as described in this previous post. Further, when performing a security study, it is important that also human factors and organizational factors are taken into account – a good technical perimeter defense does not help if the users are not up to the task and have sufficient awareness of the security problem.

In the respect of organizational context, the new Clause 11.2.12 is particularly interesting as it will require security awareness and organizational resilience planning to be integrated into the functional safety management planning. As noted by many others, we have seen a sharp rise in attacks on SCADA systems over the past few years – these security requirements will bring the reliability and security fields together and ensure better overall risk management for important industrial assets. These benefits, however, will only be achieved if practitioners take the full weight of the new requirements on board.

What is the difference between software and hardware failures in a reliability context?

Reliability engineers have traditionally focused more on hardware than software. There are many reasons for this; one reason is that traditionally safety systems have been based on analog electronics, and although digitial controls and PLC’s have been introduced throughout the 1990’s, the actual software involved was in the beginning very simple. Today the situation has really changed, but the focus in reliability has not completely taken this onboard. One of the reasons may be that reliability experts like to calculate probabilities – which they are very good at doing for hardware failures. Hardware failures tend to be random and can be modeled quite well using probabilistic tools. So – what about software? The failure mechanisms are very different – as failures in hardware are related to more or less stochastic effects stemming from load cycling, material defects and ageing, software defects or completely deterministic (we disregard stochastic algorithms here – they are banned from use in safety critical control system anyway).

Software defects exist for two reasons: design errors (flaws) and implementation errors (bugs). These errors may occur at the requirement stage or during actual coding, but irrespective of the time they occur, they are always static. They do not suddenly occur – they are latent errors hidden within the code – that will active each and every time the software state where the error is relevant is visited.

Such errors are very difficult to include in a probabilistic model. That is why reliability standards prescribe a completely different medicine; a process oriented framework that gives requirements to management, choice of methods and tools, as well as testing and documentation. These quality directed workflows and requirements are put in place such that we should have some confidence in the software not being a significant source of unsafe failures of the critical control system.

Hence – process verification and auditing take the place of probability calculations when we look at the software. In order to achieve the desired level of trust it is very important that these practices are not neglected in the functional safety work. Deterministic errors may be just as catastrophic as random ones – and therefore they must be managed with just as much rigor and care. The current trend is that more and more functionality is moved from hardware to software – which means that software errors are becoming increasingly important to manage correctly if we are not going to degrade both performance and trust of the safety instrumented systems we rely on to protect our lives, assets and the environment.

Does safety engineering require security engineering?

Safetey critical control systems are developed with respect to reliability requirements, often following a reliability standard such as IEC 61508 or CENELEC EN 50128. These standards put requirements on development practices and activities with regard to creating software that works the way it is intended based on the expected input, and where availability and integrity is of paramount importance. However, these standards do not address information security. Some of the practices required from reliability standards do help in removing bugs and design flaws – which to a large extent also removes security vulnerabilites – but they do not explicitly express such conceerns. Reliability engineering is about building trust into the intended functionality of the system. Security is about lack of unintended functionality.

Consider a typical safety critical system installed in an industrial process, such as an overpressure protection system. Such a system may consist of a pressure transmitter, a logic unit (ie a computer) and some final elements. This simple system meausres the pressure  and transmits it to the computer, typically over a hardwired analog connection. The computer then decides if the system is within a safe operating region, or above a set point for stopping operation. If we are in the unsafe region, the computer tells the final element to trip the process, for example by flipping an electrical circuit breaker or closing a valve. Reliability standards that include software development requirements focus on how development is going to work in order to ensure that whenever the sensor transmits pressure above the threshold, the computer will tell the process to stop. Further the computer is connected over a network to an engineering station which is used for such things as updating the algorithm in the control system, changing the threshold limits, etc.

What if someone wants to put the system out of order, without anyone noticing? The software’s access control would be a crucial barrier against anyone tampering with the functionality. Reliability standards do not talk about how to actually avoid weak authentication schemes, although they talk about access management in general. You may very well be compliant with the reliability standard – yet have very weak protection against compromising the access control. For example, the coder may very well use a “getuser()” call  in C in the authentication part of the software – without violating the reliability standard requirements. This is a very unsecure way of getting user credentials from the computer and should generally be avoided. If such a practice is used, a hacker with access to the network could with relaitve ease  get admin access to the system and change for example set points, or worse, recalibrate the pressure  sensor to report wrong readings – something that was actually done in the Stuxnet case.

In other words – as long as someone can be interested in harming your operation – your safety system needs security built-in, and that is not coming for free through reliability engineering. And there is always someone out to get you – for sports, for money or just because they do not like you. Managing security is an important part of managing your business risk – so do not neglect this issue while worrying only about reliability of intended functionality.