Zero-Day OT Nightmare? How Zero-Trust Can Stop APT attacks

It was a crisp summer Monday, and Alex, the maintenance engineer at Pulp Friction Paper Company, arrived with his coffee, ready to tackle the day. He reveled in the production regularity achieved thanks to his recently implemented smart maintenance program. This program used machine learning to anticipate condition degradation in machinery, a significant improvement over the facility’s previous reliance on traditional periodic maintenance or the ineffective risk-based approaches.

Alex, a veteran at Pulp Friction, had witnessed the past struggles. Previously, paper products were frequently rejected due to inconsistencies in humidity control, uneven drying, or even mechanical ruptures. He was a firm believer in leveraging modern technology, specifically AI, to optimize factory operations. While not a cybersecurity expert, his awareness wasn’t limited to just using technology. He’d read about the concerning OT (Operational Technology) attacks in Denmark last year, highlighting the inherent risks of interconnected systems.

As a seasoned maintenance professional, Alex understood the importance of anticipating breakdowns for effective mitigation. He empathized with the security team’s constant vigilance against zero-day attacks – those unpredictable, catastrophic failures that could turn a smooth operation into a major incident overnight.

Doctors analysing micro blood samples on advanced chromatographic paper form
Pulp Friction Paper Mills

Dark clouds over Pulp Friction

A digital phantom stalked the web. “The Harvester,” a notorious APT (Advanced Persistent Threat) group known for targeting high-value assets, had Pulp Friction Paper Company in their crosshairs. Their prize? Not paper, but a revolutionary innovation: medical diagnostic paper. Pulp Friction had recently begun producing these specialized sheets, embedded with advanced materials, for use in chromatographic tests. This cutting-edge technology promised rapid diagnosis of a multitude of diseases from mere microliter blood samples, a potential game-changer in the medical field. Unbeknownst to Alex, a gaping zero-day vulnerability resided within the facility’s industrial control system (ICS) software. If exploited, The Harvester could wreak havoc, disrupting production of these life-saving diagnostic tools and potentially delaying critical medical care for countless individuals. The stakes had just been raised. Could Alex, with his limited cybersecurity awareness, and the current defenses, thwart this invisible threat and ensure the smooth flow of this vital medical technology?

A wave of unease washed over Alex as he stared at the malfunctioning control panel. The usually predictable hum of the paper production line had been replaced by a cacophony of alarms and erratic readings. Panic gnawed at him as vital indicators for the chromatographic test paper production process lurched erratically. This wasn’t a typical equipment malfunction – it felt deliberate, almost malicious.

Just then, a memory flickered in Alex’s mind. Sarah, the friendly and highly skilled network security specialist he occasionally consulted with, had been pushing for a new security system called “zero-trust.” While Alex appreciated Sarah’s expertise, he hadn’t quite understood the nuances of the system or its potential benefits. He’d brushed it off as an extra layer of complexity for an already demanding job.

Now, regret gnawed at him alongside the growing sense of dread. Grabbing his phone, Alex dialed Sarah’s number, his voice laced with a tremor as he blurted out, “Sarah, something’s terribly wrong with the ICS system! The readings are all messed up, and I don’t know what’s happening!” The urgency in his voice was impossible to miss, and Sarah, sensing the dire situation, promised to be there as soon as possible. With a heavy heart, Alex hung up, the echo of his own ignorance a stark reminder of the consequences he might have inadvertently unleashed by ignoring the recommendations on network security improvements.

The Harvester: a technical intermezzo

Diagram made with mermaid.live

The Harvester is capable of zero-day research and exploit development. In this attack they are targeting companies using advanced technologies to supply to healthcare providers – and many of those companies use innovative maintenance systems.

They first find a web exposed server used by the AI driven maintenance system. The system is Internet exposed due to frequent need for access by multiple vendors. By exploiting the vulnerability there, they gain root access to the underlying Linux operating system. The Harvester, like many other threat actors, then install a web shell for convenient persistent access, and continue to move on using conventional techniques. Reaching the engineering workstation, the attacker is able to reprogram PLC’s, and disable safety features. Having achieved this, the system is no longer a highly reliable production system for diagnostic test paper: it is a bleeping mess spilling water, breaking paper lines and causing a difficult-to-fix mess.

They continue to pose as Pulp Friction employees, leaking CCTV footage of the mess on the factory floor, showing panicking employees running around, and also post on social media claiming Pulp Friction never cared about reliability or security, and that money was the only goal, without any regard for patient safety: this company should never be allowed to supply anything to hospitals or care providers!

What it took to get back to business

Sarah arrived at Pulp Friction, a whirlwind of focused energy. Immediately, she connected with Alex and reviewed the abnormal system behavior. Her sharp eyes landed on the internet access logs for the smart maintenance system – a system Alex had mentioned implementing. Bingo! This web-exposed system, likely the initial point of entry, was wide open to the internet. Without hesitation, Sarah instructed the IT team to isolate and disable the internet access for the maintenance system – a crucial first step in stemming the bleeding.

“The only thing necessary for the triumph of evil is for good men to do nothing.”

Edmund Burke

Cybersecurity meaning: “Don’t be a sitting duck the day the zero-day is discovered.”

Next, she initiated the full incident response protocol, securing compromised systems, isolating affected network segments, and reaching out to both the Pulp Friction IT team and external forensics experts. The following 48 hours were a blur – a symphony of collaboration. Sarah led the incident response, directing forensics on evidence collection and containment, while the IT team worked feverishly to restore services and patch vulnerabilities.

Exhausted but resolute, Sarah and Alex presented their findings to the CEO. The CEO, witnessing the team’s dedication and the potential consequences, readily approved Sarah’s plan for comprehensive security improvements, including implementing zero-trust and segmentation on the OT network, finally putting Pulp Friction on the path to a more robust defense. They couldn’t erase the attack, but they could ensure it wouldn’t happen again.

With the immediate crisis averted, Sarah knew a stronger defense was needed. She turned to Alex, his eyes reflecting a newfound appreciation for cybersecurity. “Remember zero-trust, Alex? The system I’ve been recommending?” Alex nodded, his earlier skepticism replaced by a desire to understand.

“Think of it like guarding a high-security building,” Sarah began. “No one gets in automatically, not even the janitor. Everyone, from the CEO to the maintenance crew, has to show proper ID and get verified every time they enter.”

Alex’s eyes lit up. “So, even if someone snuck in through a hidden door (like the zero-day), they wouldn’t have access to everything?”

“Exactly!” Sarah confirmed. “Zero-trust constantly checks everyone’s access, isolating any compromised systems. Imagine the attacker getting stuck in the janitor’s closet, unable to reach the control room.”

Alex leaned back, a relieved smile spreading across his face. “So, with zero-trust, even if they got in through that maintenance system, they wouldn’t be able to mess with the paper production?”

“Precisely,” Sarah said. “Zero-trust would limit their access to the compromised system itself, preventing them from reaching critical control systems or causing widespread disruption.”

With the analogy clicking, Alex was on board. Together, Sarah and Alex presented the zero-trust solution to the CEO, emphasizing not only the recent attack but also the potential future savings and improved operational efficiency. Impressed by their teamwork and Sarah’s clear explanation, the CEO readily approved the implementation of zero-trust and segmentation within the OT network.

Pulp Friction, once vulnerable, was now on the path to a fortress-like defense. The zero-day vulnerability might have been a wake-up call, but with Sarah’s expertise and Alex’s newfound understanding, they had turned a potential disaster into a catalyst for a much stronger security posture. As production hummed back to life, creating the life-saving diagnostic paper, a sense of quiet satisfaction settled in. They couldn’t erase the attack, but they had ensured it wouldn’t happen again.

How Alex and Sarah collaborated to achieve zero-trust benefits in the OT network

Zero-trust in the IT world relies a lot on identity and endpoint security posture. Both those concepts can be hard to implement in an OT system. This does not mean that zero-trust concepts have no place in industrial control systems, it just means that we have to play within the constraints of the system.

  • Network segregation is critical. Upgrading from old firewalls to modern firewalls with strong security features is a big win.
  • Use smaller security zones than what has been traditionally accepted
  • For Windows systems in the factory, on Layer 3 and the DMZ (3.5) in the Purdue model, we are primarily dealing with IT systems. Apply strong identity controls, and make patchable systems. The excuse is often that systems cannot be patched because we allow no downtime, but virtualization and modern resilient architectures allow us to do workload management and zero-downtime patching. But we need to plan for it!
  • For any systems with weak security features, compensate with improved observability
  • Finally, don’t expose things to the Internet. Secure your edge devices, use DMZ’s, VPN’s, and privileged access management (PAM) systems with temporary credentials.
  • Don’t run things as root/administrator. You almost never need to.

In a system designed like this, the maintenance server would not be Internet exposed. The Harvester would have to go through a lot of hoops to land on it with the exploit. Assuming the threat actor manages to do that through social engineering and multiple hops of lateral movement, it would still be very difficult to move on from there:

  • The application isn’t running as root anymore – only non-privileged access
  • The server is likely placed in its own information management zone, receiving data through a proxy or some push/pull data historian system. Lateral movement will be blocked on the firewall, or at least require a hard-to-configure bypass.
  • The engineering workstations are isolated and not network reachable without a change request for firewall rules. Getting to the place the settings and logic can be changed gets difficult.
  • The PLC’s are configured to not be remotely programmable without a physical change (like a physical key controlling the update mode).

Using Sarah’s plan, the next time The Harvester comes along, the bad guy is turned away at the door, or gets locked into the janitor’s closet. The diagnostic paper is getting shipped.

Key take-aways:

  1. Exposing critical systems directly on the Internet is not a good idea, unless it is meant to be a web service engineered for that type of hostile environment
  2. Zero-trust in OT systems is possible, and is a good strategy to defend against zero-days.
  3. Defenders must be right all the time, hackers only need to get lucky once is a lie – if you implement good security architecture. Lucky once = locked into the janitor’s closet.

Impact of OT attacks: death, environmental disasters and collapsing supply-chains

Securing operational technologies (OT) is different from securing enterprise IT systems. Not because the technologies themselves are so different – but the consequences are. OT systems are used to control all sorts of systems we rely on for modern society to function; oil tankers, high-speed trains, nuclear power plants. What is the worst thing that could happen, if hackers take control of the information and communication technology based systems used to operate and safeguard such systems? Obviously, the consequences could be significantly worse than a data leak showing who the customers of an adult dating site are. Death is generally worse than embarrassment.

No electricity could be a consequence of an OT attack.

When people think about cybersecurity, they typically think about confidentiality. IT security professionals will take a more complete view of data security by considering not only confidentiality, but also integrity and availability. For most enterprise IT systems, the consequences of hacking are financial, and sometimes also legal. Think about data breaches involving personal data – we regularly see stories about companies and also government agencies being fined for lack of privacy protections. This kind of thinking is often brought into industrial domains; people doing risk assessments describe consequences in terms such as “unauthorized access to data” or “data could be changed be an unauthorized individual”.

The real consequences we worry about are physical. Can a targeted attack cause a major accident at an industrial plant, leaking poisonous chemicals into the surroundings or starting a huge fire? Can damage to manufacturing equipment disrupt important supply-chains, thereby causing shortages of critical goods such as fuels or food? That is the kind of consequences we should worry about, and these are the scenarios we need to use when prioritizing risks.

Let’s look at three steps we can take to make cyber risks in the physical world more tangible.

Step 1 – connect the dots in your inventory

Two important tools for cyber defenders of all types are “network topologies” and “asset inventory”. If you do not have that type of visibility in place, you can’t defend your systems. You need to know what you have to defend it. A network topology is typically a drawing showing you what your network consists of, like network segments, servers, laptops, switches, and also OT equipment like PLC’s (programmable logic curcuits), pressure transmitters and HMI’s (human-machine interfaces – typically the software used to interact with the sensors and controllers in an industrial plant). Here’s a simple example:

An example of a simplified network topology

A drawing like this would be instantly recognizable to anyone working with IT or OT systems. In addition to this, you would typically want to have an inventory describing all your hardware systems, as well as all the software running on your hardware. In an environment where things change often, this should be generated dynamically. Often, in OT systems, these will exist as static files such as Excel files, manually compiled by engineers during system design. It is highly likely to be out of date after some time due to lack of updates when changes happen.

Performing a risk assessment based on these two common descriptions is a common exercise. The problem is, that it is very hard to connect this information to the physical consequences we want to safeguard against. We need to know what the “equipment under control” is, and what it is used for. For example, the above network may be used to operate a batch chemical reactor running an exothermic reaction. That is, a reaction that produces heat. Such reactions need cooling, if not the system could overheat, and potentially explode as well if it produces gaseous products. We can’t see that information from the IT-type documentation alone; we need to connect this information to the physical world.

Let’s say the system above is controlling a reactor that has a heat-producing reaction. This reactor needs cooling, which is provided by supplying cooling water to a jacket outside the actual reactor vessel. A controller opens and closes a valve based on a temperature measurement in order to maintain a safe temperature. This controller is the “Temperature Control PLC” in the drawing above. Knowing this, makes the physical risk visible.

Without knowing what our OT system controls, we would be led to think about the CIA triad, not really considering that the real consequences could be a severe explosion that could kill nearby workers, destroy assets, release dangerous chemical to the environment, and even cause damage to neighboring properties. Unfortunately, lack of inventory control, especially connecting industrial IT systems to the physical assets they control, is a very common problem (disclaimer – this is an article from DNV, where I am employed) across many industries.

An example of a physical system: a continuously stirred-tank reactor (CSTR) for producing a chemical in a batch-type process.

Step 1 – connect the dots: For every server, switch, transmitter, PLC and so on in your network, you need to know what jobs these items are a part of performing. Only that way, you can understand the potential consequences of a cyberattack against the OT system.

Step 2 – make friends with physical domain experts

If you work in OT security, you need to master a lot of complexity. You are perhaps an expert in industrial protocols, ladder logic programming, or building adversarial threat models? Understanding the security domain is itself a challenge, and expecting security experts to also be experts in all the physical domains they touch, is unrealistic. You can’t expect OT security experts to know the details of all the technologies described as “equipment under control” in ISO standards. Should your SOC analyst be a trained chemical engineer as well as an OT security expert? Or should she know the details of steel strength decreases with a temperature increase due to being engulfed in a jet fire? Of course not – nobody can be an expert at everything.

This is why risk assessments have to be collaborative; you need to make sure you get the input from relevant disciplines when considering risk scenarios. Going back to the chemical reactor discussed above, a social engineering incident scenario could be as follows.

John, who works as a plant engineer, receives a phishing e-mail that he falls for. Believing the attachment to be a legitimate instruction of re-calibration of the temperature sensor used in the reactor cooling control system, he executes the recipe from the the attachment in the e-mail. This tells him to download a Python file from Dropbox folder, and execute it on the SCADA server. By doing so, he calibrates the temperature sensor to report 10 degrees lower temperature than what it really measures. It also installs a backdoor on the SCADA server, allowing hackers to take full control of it over the Internet.

The consequences of this could potentially overpressurizing the reactor, causing a deadly explosion. The lack of cooling on the reactor would make a chemical engineering react, and help understand the potential physical consequences. Make friends with domain experts.

Another important aspect of domain expertise, is knowing the safety barriers. The example above was lacking several safety features that would be mandatory in most locations, such as having a passive pressure-relief system that works without the involvement of any digital technologies. In many locations it is also mandatory to have a process shutdown systems, a control system with separate sensors, PLC’s and networks to intervene and stop the potential accident from happening by using actuators also put in place only for safety use, in order to avoid common cause failures between normal production systems and safety critical systems. Lack of awareness of such systems can sometimes make OT security experts exaggerate the probability of the most severe consequences.

Step 2 – Make friends with domain experts. By involving the right domain expertise, you can get a realistic picture of the physical consequences of a scenario.

Step 3 – Respond in context

If you find yourself having to defend industrial systems against attacks, you need an incident response plan. This is no different from an enterprise IT environment; you also need an incident response plan that takes the operational context into account here. A key difference, though, is that for physical plants your response plan may actually involve taking physical action, such as manually opening and closing valves. Obviously, this needs to be planned – and exercised.

If welding will be a necessary part of handling your incident, coordinating with the industrial operations side better be part of your incident response playbooks.

Even attacks that do not affect OT systems directly, may lead to operational changes in the industrial environment. Hydro, for example, was hit with a ransomware attack in 2019, crippling its enterprise IT systems. This forced the company to turn to manual operations of its aluminum production plants. This bears lessons for us all, we need to think about how to minimize impact not just after an attack, but also during the response phase, which may be quite extensive.

Scenario-based playbooks can be of great help in planning as well as execution of response. When creating the playbook we should

  • describe the scenario in sufficient detail to estimate affected systems
  • ask what it will take to return to operations if affected systems will have to be taken out of service

The latter question would be very difficult to answer for an OT security expert. Again, you need your domain expertise. In terms of the cyber incident response plan, this would lead to information on who to contact during response, who has the authority to make decision about when to move to next steps, and so on. For example, if you need to switch to manual operations in order to continue with recovery of control system ICT equipment in a safe way, this has to be part of your playbook.

Step 3 – Plan and exercise incident response playbooks together with industrial operations. If valves need to be turned, or new pipe bypasses welded on as part of your response activities, this should be part of your playbook.

OT security is about saving lives, the environment and avoiding asset damage

In the discussion above it was not much mention of the CIA triad (confidentiality, integrity and availability), although seen from the OT system point of view, that is still the level we operate at. We still need to ensure only authorized personnel has access to our systems, we need to ensure we protect data during transit and in storage, and we need to know that a packet storm isn’t going to take our industrial network down. The point we want to make is that we need to better articulate the consequences of security breaches in the OT system.

Step 1 – know what you have. It is often not enough to know what IT components you have in your system. You need to know what they are controlling too. This is important for understanding the risk related to a compromise of the asset, but also for planning how to respond to an attack.

Step 2 – make friends with domain experts. They can help you understand if a compromised asset could lead to a catastrophic scenario, and what it would take for an attacker to make that happen. Domain experts can also help you understand independent safety barriers that are part of the design, so you don’t exaggerate the probability of the worst-case scenarios.

Step 3 – plan your response with the industrial context in mind. Use the insight of domain experts (that you know are friends with) to make practical playbooks – that may include physical actions that need to be taken on the factory floor by welders or process operators.