Connecting OT to Cloud: Key Questions for Practitioners

When we first started connecting OT systems to the cloud, it was typically to get access to data for analytics. That is still the primary use case, with most vendors offering some SaaS integration to help with analytics and planning. The cloud side of this is now more flexible than before, with more integrations, more capabilities, more AI, even starting to push commands back into the OT world from the cloud – something we will only see more of in the future. The downside of that as seen from the asset owner’s point of view is that the critical OT system with its legacy security model and old systems are now connected to a hyperfluid black box making decisions for the physical world on the factory floor. There are a lot of benefits to be had, but also a lot of things that could go wrong.

How can OT practicioners learn to love the cloud? Let’s consider 3 key questions to ask in our process to assess the SaaS world from an OT perspective!

The first thing we have to do is accept that we’re not going to know everything. The second thing we have to do is ask ourselves, ‘What is it we need to know to make a decision?’… Let’s figure out what that is, and go get it.

Leo McGarry – character in “The West Wing”

The reason we connect our industrial control systems to the cloud, is that we want to optimize. We want to stream data into flexible compute resources, to be used by skilled analysts to make better decisions. We are slowly moving towards allowing the cloud to make decisions that are feeding back into the OT system, making changes in the real world. From the C-Suite, doing this is a no-brainer. How these decisions challenge the technology and the people working on the factory floors, can be hard to see from the birds-eye view where the discussion is about competitive advantage and efficiency gain instead of lube oil pressure or supporting a control panel still running on Windows XP.

The OT world is stable, robust, traditional , whereas the cloud world is responsive, in a constant flux, adaptable. When people managing stable meet people managing flux meet, discussions can be difficult, like the disciples of Heraclitus debating the followers of Parmenides in ancient Greek phillosophy.

Question 1: How can I keep track of changes in the cloud service?

Several OT practitioners have mentioned an unfamiliar challenge: the SaaS in the cloud changes without the knowledge of the OT engineers. They are used to strict management of change procedures, the cloud is managed as a modern IT project with changes happening continuously. This is like putting Parmenides up against Heraclitus; we will need dialog to make this work.

Trying to convince the vendor to move away from modern software development practices with CI/CD pipelines and frequent changes to a more formal process with requirements, risk assessment spreadsheets and change acceptance boards is not likely to be a successful approach, although it may seem to be the most natural response to a new “black box” in the OT network for many engineers. At the same time, expecting OT practitioners to embrace a “move fast and break things, then fix them” is also, fortunately, not going to work.

  • SaaS vendors should be transparent with OT customers what services are used and how they are secured, as well as how it can affect the OT network. This overview should preferably be available to the asset owner dynamically, and not as a static report.
  • Asset owners should remain in control which features will be used
  • Sufficient level of observability should be provided across the OT/cloud interface, to allow a joint situational understanding when it comes to the attack surface, cyber risk and incident management.

Question 2: Is the security posture of the cloud environment aligned with my OT security needs?

A key worry among asset owners is the security of the cloud solution, which is understandable given the number of data breaches we can read about in the news. Some newer OT/cloud integrations also challenge the traditional network based security model with a push/pull DMZ for all data exchange. Newer systems sometimes includes direct streaming to the cloud over the Internet, point-to-point VPN and other alternative data flows. Say you have a crane operating in a factory, and this crane has been given a certain security level (SL2) with corresponding security requirements. The basis for this assessment has been that the crane is well protected by a DMZ and double firewalls. Now an upgrade of the crane wants to install a new remote access feature and direct cloud integration via a 5G gateway delivered by the vendor. This has many benefits, but is challenging the traditional security model. The gateway itself is certified and is well hardened, but the new system allows traffic from the cloud into the crane network, including remote management of the crane controllers. On the surface, the security of the SaaS seems fine, but the OT engineer feels it is hard to trust the vendor here.

One way the vendor can help create the necessary trust here, is to allow the asset owner to see the overall security posture generated by automated tools, for example a CSPM solution. This information can be hard to interpret for the customer, so a selection of data and context explanations will be needed. An AI agent can assist with this, for example mapping the infrastructure and security posture metrics to the services in use by the customer.

Question 3: How can we change the OT security model to adapt to new cloud capabilities?

The OT security model has for a long time been built on network segmentation, but with very static resources and security needs. When we connect these assets into a cloud environment that is undergoing more rapid changes, it can challenge the local security needs in the OT network. Consider the following fictitious crane control system.

Crane with cloud integrations via 5G

In the situation of the crane example, the items in the blue box are likely to be quite static. The applications in the cloud are likely to see more rapid change, such as more integrations, AI assistants, and so on. A question that will have a large impact on the attack surface exposure of the on-prem crane system here, is the separation between components in the cloud. Imagine if the web application “Liftalytics” is running on a VM with a service account with too much privileges? Then, a vulnerability allowing an attacker to get a shell on this web application VM may move laterally to other cloud resources, even with network segregation in place. These type of security issues are generally invisible to the asset owner and OT practitioners.

If we start the cloud integration without any lateral movement path between a remote access system used by support engineers, and the exposed web application, we may have an acceptable situation. But imagine now that a need appears that makes the vendor connect the web app and the remote access console, creating a lateral movement path in the cloud. This must be made visible, and then the OT owner should:

  1. Have to explicitly accept this change for it to take action
  2. If the change is happening, the change in security posture and attack surface must be communicated, so that compensating measures can be taken in the on-prem environment

For example, if a new lateral movement path is created and this exposes the system to unacceptable risk, local changes can be done such as disabling protocols on the server level, adding extra monitoring, etc.

The tool we have at our disposal to make better security architectures is threat modeling. By using not only insights into the attack surface from automated cloud posture management tools, but also cloud security automation capabilities, together with required changes in protection, detection and isolation capabilities on-prem, we can build a living holistic security architecture that allows for change when needed.

Key points

Connecting OT systems to the cloud creates complexity, and sometimes it is hidden. We set up 3 questions to ask to start the dialog between the OT engineers managing the typically static OT environment and the cloud engineers managing the more fluid cloud environments.

  1. How can I keep track of changes in the cloud environment? – The vendor must expose service inventory and security posture dynamically to the consumer.
  2. Is the security posture of the cloud environment aligned with my security level requirements? – The vendor must expose security posture dynamically, including providing the required context to see what the on-prem OT impact can be. AI can help.
  3. How can we change the OT security model to adapt to new cloud capabilities? We can leverage data across on-prem and cloud combined with threat modeling to find holistic security architectures.

Do you prefer a podcast instead? Here’s an AI generated one (with NotebookLM):


Doing cloud experiments and hosting this blog costs money – if you like it, a small contribution would be much appreciated: coff.ee/cyberdonkey

Practical SaaS Security Assessment Checklist for 2024

We all use web application for a lot of the business computing we do. That means that we need to care about the security of the applications we use, but this is not always so easy to assess. The traditional approach with sending long security questionnaires won’t get you very far. That’s why I developed a practical checklist approach described below – and there’s a template too for subscribers to this blog!

In 2021 Daniel Miessler had a great blog post on the failings of security questionnaires, and what to do instead, that I also commented on this blog: Vendor Security Management: how to decide if tech is safe (enough) to use. The essence of that thinking is that questionnaires won’t help much, and we should instead worry about whether there is a security program in place, and how they handled the last breach. We can take that though one step further, and create a practical assessment process for SaaS apps we are considering using. The great thing about SaaS apps is we get to test some of the security by using the tech, not only readying claims from others.

By using a checklist and giving it some scores based on security controls we think should be in place, we get a practical approach to assess the security. This won’t give you a complete answer, but it will relatively quickly give you a way to sort the bad from the potentially good.

google sheet with security assessment

The way we built this checklist is by dividing our checks into 6 categories. We could have used more, and it is a good idea to tailor the controls you check to what’s important for you. In this example we have used the following categories:

  • Identity: most breaches happen at the user account level. This is important.
  • Integrations: API keys leaking and kneeling applications due to DDoS are not fun. Do some checks.
  • Backups: You definitely want backups.
  • Data protection: how do you make sure other SaaS users can’t access your data? And what about the SaaS provider?
  • Logging: if things go wrong, you want to be able of seeing that. If you are blind, you have no security. Logs are critical.
  • Privacy: not only a legal issue, it is also important for everyone using the app. Colleagues and customers alike.

Let’s take a look at the identity checklist. I have organized each checklist with just a few checkpoints I find important into different sheets in a Google Sheet.

Identity checklist

For each line there is a checkpoint, some guidance on how to check, and a dropdown where you can choose the rating “good, weak, or bad”. You can also set it to “not applicable” if you think for some reason that a particular control is not interesting for the current use case. There is also a cell to jot down some notes about your assessment. Below the table I have added some extra assessment advice to make it easier for the user to evaluate what’s more important in the checklist.

For each category, an overall score as a percentage. I don’t think you should use this as a hard threshold but low scores are worse than high scores. I used the following point scale to calculate the overall score:

SCORE = -(number of bad items) + 0.5 x (number of weak items) + (number of good items) / (number of applicable items)

This is not a scientific formula, but it seems to give reasonable spread of the scores. The score is punished by bad results, you get a little bit of credit for weak results, and the “best score” is still 100%.

The Google sheet is free to anyone subscribing to this blog – enjoy 🙂

For subscribers: here’s the checklist: Free SaaS security evaluation template.

Catching bad guys in your system logs

When attackers target our systems, they leave traces. The first place to look is really the logs. Hopefully the most important logs are being collected and sent to a SIEM (security incident and event management) system, but in any case, we need to know how to search logs to find traces of malicious activity. Let’s consider three very common attack scenarios:

• Brute-force attack on exposed remote access port (SSH or RDP)
• Establishing persistence through a cron job or a scheduled task
• Adding accounts or credentials to maintain persistence

Attackers leave footprints from their actions. The primary tool for figuring out what happened on a system, is log analysis.

Brute force

Brute-force attack: an attacker may try to gain access by guessing a password. This will be visible in logs through a number of failed logon attempts, often from the same ip address. If your system is exposed to the Internet, this is constantly ongoing. The attackers are not human operators but botnets scanning the entire Internet, hoping to gain access. An effective way of avoiding this is to reduce the attack surface and not expose RDP or SSH directly on the internet.

For Windows, failed logon attempts will generate event log entries with Event ID 4625. What you should be looking for is a number of failed attempts (ID 4625), followed by a successful attempt from the same ip address. Successful logins have Event ID 4624. You will need administrator privileges to read the Windows logs. You can use the Event Viewer application on Windows to do this, but if you want to create a more automated detection, you can use a PowerShell script to check the logs. You still need that administrator access though.

The Powershell command Get-WinEvent can be used to read Event logs. You can see how to use the command here. https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.diagnostics/get-winevent?view=powershell-7.2

You can also use Get-EventLog if you are on PowerShell 5, but that commandlet is not longer present in Powershell 7.

For attacks on SSH on Linux, you will find entries in the authpriv file. But the easiest way to spot malicious logon attempts is to use the command “lastb” that will show you the last failed logon attempts. This command requires sudo privileges. If you correlate a series of failed attempts reported by “lastb” with a successful attempt found in “authpriv” from the same ip address, you probably have a breach.

lastb: The last 10 failed login attempts on a cloud hosted VM exposing SSH on port 22 to the Internet.

Persistence

Let’s move on to persistence through scheduled tasks or cron jobs

The Event ID you are looking for on Windows is 4698. This means a scheduled task was created. There are many reasons to create scheduled tasks; it can be related to software updates, cleanup operations, synchronization tasks and many other things. It is also a popular way to establish persistence for an attacker. If you have managed to drop a script or a binary file on a target machine, and set a scheduled task to execute this on a fixed interval, for example every 5 minutes, you have an easy way to make malware reach out to a command and control server on the Internet.

There are two types of scheduled tasks to worry about here; one is running under the user account. This task will only run when the user is logged on to the computer. If the attacker can establish a scheduled task to run with privileges, the task will run without having a user being logged on – but the computer must of course be in a running state. Because of this, it is a good idea to check the user account that created the scheduled task.

For further details on threat hunting using scheduled task events, see the official documentation from Microsoft: https://docs.microsoft.com/en-us/windows/security/threat-protection/auditing/event-4698. There is also a good article from socinvestigation worth taking a look at: https://www.socinvestigation.com/threat-hunting-using-windows-scheduled-task/.

Cron jobs are logged to different files depending on the system you are on. Most systems will log cron job execution to /var/log/syslog, whereas some, such as CoreOS and Amazon Linux, will log to /var/log/cron. For a systemd based Linux distro, you can also use “journalctl -u cron” to view the cron job logs. Look for jobs executing commands or binaries you don’t know what is. Then verify what those are.

You do not get exit codes in the default cron logs, only what happens before the command in the cron job executes. Exit logs are by default logged to the mailbox of the job’s owner but this can be configured to log to a file instead. Usually seeing the standard cron logs is sufficient to discover abuse of this feature to gain persistence or run C2 communications.

Adding accounts

Finally, we should check if an attacker has added an account, a common way to establish extra persistence channels.

For Windows, the relevant Event ID is 4720. This is generated every time a user account is created, whether centrally on a domain controller, or locally on a workstation. If you do not expect user accounts to be created on the system, every Event ID like this should be investigated. The Microsoft documentation has a long list of signals to monitor for regarding this event: https://docs.microsoft.com/en-us/windows/security/threat-protection/auditing/event-4720.

On Linux, the command “adduser” can be used to add a new user. Creating a new user will create an entry in the /var/log/auth.log file. Here’s an example form adding a user called “exampleuser” on Ubuntu (running on a host called “attacker”).

Jan 29 20:14:27 attacker sudo: cyberhakon : TTY=pts/0 ; PWD=/home/cyberhakon ; USER=root ; COMMAND=/usr/sbin/useradd exampleuser
Jan 29 20:14:27 attacker useradd[6211]: new group: name=exampleuser, GID=1002
Jan 29 20:14:27 attacker useradd[6211]: new user: name=exampleuser, UID=1001, GID=1002, home=/home/exampleuser, shell=/bin/sh

Changing the password for the newly created user is also visible in the log.

an 29 20:18:20 attacker sudo: cyberhakon : TTY=pts/0 ; PWD=/var/log ; USER=root ; COMMAND=/usr/bin/passwd exampleuser
Jan 29 20:18:27 attacker passwd[6227]: pam_unix(passwd:chauthtok): password changed for exampleuser

Summary: we can detect a lot of common attacker behavior just by looking at the default system logs. Learning how to look for such signals is very useful for incident response and investigations. Even better is to be prepared and forward logs to a SIEM, and create alerts based on behavior that is expected from attackers, but not from regular system use. Then you can stop the attackers before much damage is done.

CCSK Domain 1: Cloud Computing Concepts and Architecture

Recently I participated in a one-day class on the contents required for the “Certificate of Cloud Security Knowledge” held by Peter HJ van Eijk in Trondheim as part of the conference Sikkerhet og Sårbarhet 2019 (translates from Norwegian to: Security and Vulnerability 2019). The one-day workshop was interesting and the instructor was good at creating interactive discussions – making it much better than the typical PowerPoint overdose of commmercial professional training sessions. There is a certification exam that I have not yet taken, and I decided I should document my notes on my blog; perhaps others can find some use for them too.

The CCSK exam closely follows a document made by the Cloud Security Alliance (CSA) called “CSA Security Guidance for Critical Areas of Focus in Cloud Computing v4.0” – a document you can download for free from the CSA webpage. They also lean on ENISA’s “Cloud Computing Risk Assessment”, which is also a free download.

Cloud computing isn’t about who owns the compute resources (someone else’s computer) – it is about providing scale and cost benefits through rapid elasticity, self-service, shared resource pools and a shared security responsibility model.

The way I’ll do these blog posts is that I’ll first share my notes, and then give a quick comment on what the whole thing means from my point of view (which may not really be that relevant to the CCSK exam if you came here for a shortcut to that).

Introduction to D1 (Cloud Concepts and Architecture)

Domain 1 contains 4 sections:  

  • Defining cloud computing 
  • The cloud logical model 
  • Cloud conceptual, architectural and reference model 
  • Cloud security and compliance scope, responsibilities and models 

NIST definition of cloud computing: a model for ensuring ubiquitous, convenient, on-demand network access to a shared pool for configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. 

A Cloud User is the person or organization requesting computational resources. The Cloud Provider is the person or organization offering the resources. 

Key techniques to create a cloud:  

  • Abstraction: we abstract resources from the underlying infrastructure to create resource pools  
  • Orchestration: coordination of delivering resources out of the pool on demand.  

Clouds are multitenant by nature. Consumers are segregated and isolated but share resource pools.  

Cloud computing models 

The foundation model of cloud computing of the CSA is the NIST model. A more in-depth model used as a reference model is taken from ISO/IEC.  The guidance talks mostly about the NIST model and doesn’t dive into the ISO/IEC model, which probably is sufficient for most definition needs.

Cloud computing has 5 charcteristics:

  1. Shared resource pool (compute resources in a pool that consumers can pull from)
  2. Rapid elasticity (can scale up and down quickly)
  3. Broad network access
  4. On-demand self-service (management plane, API’s)
  5. Measured service (pay-as-you-go)

Cloud computing has 3 service models

  • Software as a Service (SaaS): like Cybehave or Salesforce
  • Platform as a Service (PaaS): like WordPress or AWS Elastic Beanstalk
  • Infrastructure as a Service (IaaS): like VM’s running in Google Cloud

Cloud computing has 4 deployment models:

  • Public Cloud: pool shared by anyone
  • Private Cloud: pool shared within an organization
  • Hybrid Cloud: connection between two clouds, commonly used when an on-prem datacenter connects to a public cloud
  • Community Cloud: pool shared by a community, for example insurance companies that have formed some form of consortium

Models for discussing cloud security

The CSA document discusses multiple model type in a somewhat incoherent manner. The types of models it mentions can be categorized as follows:

  • Conceptual models: descriptions to explain concepts, such as the logic model from CSA.  
  • Controls models: like CCM 
  • Reference architectures: templates for implementing security controls 
  • Design patterns: solutions to particular problems 

The document also outlines a simple cloud security process model 

  • Identify security and compliance requirements, and existing controls 
  • Select provider, service and deployment models 
  • Define the architecture 
  • Assess the security controls 
  • Identify control gaps 
  • Design and implement controls to fill gaps 
  • Manage changes over time 

The CSA logic model

This model explains 4 “layers” of a cloud enviornment and introduces some “funny words”:

  • Infrastructure: the core components in computing infrastructure, such as servers, storage and networks 
  • Metastructure: protocols and mechanisms providing connections between infrastructure and the other layers 
  • Infostructure: The data and information (database records, file storage, etc) 
  • Applistructure: The applications deployed in the cloud and the underlying applications used ot build them. 

The key difference between traditional IT and cloud is the metastructure. Cloud metastructure contains the management plane components.  

Another key feature of cloud is that each layer tends to double. For example infrastructure is managed by the cloud provider, but the cloud consumer will establish a virtual infrastructure that will also need ot be managed (at least in the case of IaaS). 

Cloud security scope and responsibilities 

The responsibility for security domains maps to the access the different stakeholders have to each layer in the architecture stack.  

  • SaaS: cloud provider is responsible for perimeter, logging, and application security and the consumer may only have access to provision users and manage entitlemnets 
  • PaaS: the provider is typically responsible for platform security and the consumer is responsible for the security of the solutions deployed on the platform. Configuring the offered security features is often left to the consumer.  
  • IaaS: cloud provider is responsible for hypervisors, host OS, hardware and facilities, consumer for guest OS and up in the stack.  

Shared responsibility model leaves us with two focus areas:  

  • Cloud providers should clearly document internal security management and security controls available to consumers.  
  • Consumers should create a responsibility matrix to make sure controls are followed up by one of the parties 

Two compliance tools exist from the CSA and are recommended for mapping security controls:  

  • The Consensus Assessment Initiative Questionnaire (CAIQ) 
  • The Cloud Controls Matrix (CCM) 

#2cents

This domain is introductory and provides some terminology for discussing cloud computing. The key aspects from a risk management point of view are:

  • Cloud creates new risks that need to be managed, especially as it introduces more companies involved in maintaining security of the full stack compared to a full in-house managed stack. Requirements, contracts and audits become important tools.
  • The NIST model is more or less universally used in cloud discussions in practice. The service models are known to most IT practitioners, at least on the operations side.
  • The CSA guidance correctly designates the “metastructure” as the new kid on the block. The practical incarnation of this is API’s and console access (e.g. gcloud at API level and Google Cloud Console on “management plane” level). From a security point of view this means that maintaining security of local control libraries becomes very important, as well as identity and access management for the control plane in general.

In addition to the “who does what” problem that can occur with a shared security model, the self-service and fast-scaling properties of cloud computing often lead to “new and shiny” being pushed faster than security is aware of. An often overlooked part of “pushing security left” is that we also need to push both knowledge and accountability together with the ability to access the management plane (or parts of it through API’s or the cloud management console).