DevSecOps: Embedded security in agile development

The way we write, deploy and maintain software has changed greatly over the years, from waterfall to agile, from monoliths to microservices, from the basement server room to the cloud. Yet, many organizations haven’t changed their security engineering practices – leading to vulnerabilities, data breaches and lots of unpleasantness. This blog post is a summary of my thoughts on how security should be integrated from user story through coding and testing and up and away into the cyber clouds. I’ve developed my thinking around this as my work in the area has moved from industrial control systems and safety critical software to cloud native applications in the “internet economy”.

What is the source of a vulnerability?

At the outset of this discussion, let’s clarify two common terms, as they are used by me. In very unacademic terms:

  • Vulnerability: a flaw in the way a system is designed and operated, that allows an adversary to perform actions that are not intended to be available by the system owner.
  • A threat: actions performed on an asset in the system by an adversary in order to achieve an outcome that he or she is not supposed to be able to do.

The primary objective of security engineering is to stop adversaries from being able to achieve their evil deeds. Most often, evilness is possible because of system flaws. How these flaws end up in the system, is important to understand when we want to make life harder for the adversary. Vulnerabilities are flaws, but not all flaws are vulnerabilities. Fortunately, quality management helps reduce defects whether they can be exploited by evil hackers or not. Let’s look at three types of vulnerabilities we should work to abolish:

  • Bugs: coding errors, implementation flaws. The design and architecture is sound, but the implementation is not. A typical example of this is a SQL injection vulnerability in a web app.
  • Design flaws: errors in architecture and how the system is planned to work. A flawed plan that is implemented perfectly can be very vulnerable. A typical example of this is a broken authorization scheme.
  • Operational flaws: the system makes it hard for users to do things correctly, making it easier to trick privileged users to perform actions they should not. An example would be a confusing permission system, where an adversary uses social engineering of customer support to gain privilege escalation.

Security touchpoints in a DevOps lifecycle

Traditionally there has been a lot of discussion on a secure development lifecycle. But our concern is removing vulnerabilities from the system as a whole, so we should follow the system from infancy through operations. The following touchpoints do not make up a blueprint, it is an overview of security aspects in different system phases.

  • Dev and test environment:
    • Dev environment helpers
    • Pipeline security automation
    • CI/CD security configuration
    • Metrics and build acceptance
    • Rigor vs agility
  • User roles and stories
    • Rights management
  • Architecture: data flow diagram
    • Threat modeling
    • Mitigation planning
    • Validation requirements
  • Sprint planning
    • User story reviews
    • Threat model refinement
    • Security validation testing
  • Coding
    • Secure coding practices
    • Logging for detection
    • Abuse case injection
  • Pipeline security testing
    • Dependency checks
    • Static analysis
    • Mitigation testing
      • Unit and integration testing
      • Detectability
    • Dynamic analysis
    • Build configuration auditing
  • Security debt management
    • Vulnerability prioritization
    • Workload planning
    • Compatibility blockers
  • Runtime monitoring
    • Feedback from ops
    • Production vulnerability identification
    • Hot fixes are normal
    • Incident response feedback

Dev environment aspects

If an adversary takes control of the development environment, he or she can likely inject malicious code in a project. Securing that environment becomes important. The first principle should be: do not use production data, configurations or servers in development. Make sure those are properly separated.

The developer workstation should also be properly hardened, as should any cloud accounts used during development, such as Github, or a cloud based build pipeline. Two-factor auth, patching, no working on admin accounts, encrypt network traffic.

The CI/CD pipeline should be configured securely. No hard-coded secrets, limit who can access them. Control who can change the build config.

During early phases of a project it is tempting to be relaxed with testing, dependency vulnerabilities and so on. This can quickly turn into technical debt – first in one service, then in many, and at the end there is no way to refinance your security debt at lower interest rates. Technical debt compounds like credit card debt – so manage it carefully from the beginning. To help with this, create acceptable build thresholds, and a policy on lifetime of accepted poor metrics. Take metrics from testing tools and let them guide: complexity, code coverage, number of vulnerabilities with CVSS above X, etc. Don’t select too many KPI’s, but don’t allow the ones you track to slip.

One could argue that strict policies and acceptance criteria will hurt agility and slow a project down. Truth is that lack of rigor will come back to bite us, but at the same time too much will indeed slow us down or even turn our agility into a stale bureaucracy. Finding the right balance is important, and this should be informed by context. A system processing large amounts of sensitive personal information requires more formalism and governance than a system where a breach would have less severe consequences. One size does not fit all.

User roles and stories

Most systems have diffent types of users with different needs – and different access rights. Hackers love developers who don’t plan in terms of user roles and stories – the things each user would need to do with the system, because lack of planning often leads to much more liberal permissions “just in case”. User roles and stories should thus be a primary security tool. Consider a simple app for approval of travel expenses in a company. This app has two primary user types:

  • Travelling salesmen who need reimbursements
  • Bosses who will approve or reject reimbursement claims

In addition to this, someone must be able of adding and removing users, granting access to the right travelling salesmen for a given boss, etc. The system also needs an Administrator, with other words.

Let’s take the travelling salesman and look at “user stories” that this role would generate:

  • I need to enter my expenses into a report
  • I need to attach documentation such as receipts to this report
  • I need to be able of sending the report to the boss for approval
  • I want to see the approval status of my expense report
  • I need to recieve a notification if my report is not approved
  • I need to be able of correcting any mistakes based on the rejection

Based on this, it is clear that the permissions of the “travelling salesman” role only needs to give write access to some operations, for data relating to this specific user, and needs read rights on the status of the approval. This goes directly into our authorization concept for the app, and already here generates testable security annotations:

  • A travelling salesman should not be able to read the expense report of another travelling salesman
  • A travellign salesman should not be able of approving expense reports, including his own

These negative unit tests could already go into the design as “security annotations” for the user stories.

In addition to user stories, we have abusers and abuse stories. This is about the type of adversaries, and what they would like to do, that we don’t want them to be able of achieving. Let’s take as an example a hacker hired by a competitor to perform industrial espionage. We have the adversary role “industrial espionage”. Here are some abuse cases we can define that relate to motivation of a player rather than technical vulnerabilities:

  • I want to access all travel reports to map where the sales personnel of the firm are going to see clients
  • I want to see the financial data approved to gauge the size of their travel budget, which would give me information on the size of their operation
  • I’d like to find names of people from their clients they have taken out to dinner, so we know who they are talking to at potential client companies
  • I’d like to get user names and personal data that allow med to gauge if some of the employees could be recurited as insiders or poached to come work for us instead

How is this hypothetical information useful for someone designing an app to use for expense reporting? By knowing the motivations of the adversaries we can better gauge the credibility that a certain type of vulnerability will be attempted exploited. Remember: Vulnerabilities are not the same as threats – and we have limited resources, so the vulnerabilities that would help attackers achieve their goals are more important to remove than those that cannot easily help the adversary.

Vulnerabilities are not the same as threats – and we have limited resources, so the vulnerabilities that would help attackers achieve their goals are more important to remove than those that cannot easily help the adversary.

Architecture and data flow diagrams

Coming back to the sources of vulnerabilities, we want to avoid vulnerabilities of three kinds; software bugs, software design flaws, and flaws in operating procedures. Bugs are implementation errors, and the way we try to avoid them is by managing competence, workload and stress level, and by use of automated security testing such as static analysis and similar tools. Experience from software reliability engineering shows that about 50% of software flaws are implementation erorrs – the rest would then be design flaws. These are designs and architectures that do not implement the intentions of the designer. Static analysis cannot help us here, because there may be no coding errors such as lack of exception handling or lack of input validation – it is just the concept that is wrong; for example giving a user role too many privileges, or allowing a component to talk to a component it shouldn’t have access to. A good tool for identificaiton of such design flaws is threat modeling based on a data flow diagram. Make a diagram of the software data flow, break it down into components on a reasonable level, and consider how an adversary could attack each component and what could be the impact of this. By going through an excercise like this, you will likely identify potential vulnerabilities and weaknesses that you need to handle. The mitigations you introduce may be various security controls – such as blocking internet access for a server that only needs to be available on the internal network. The next question then is – how do you validate that your controls work? Do you order a penetration test form a consulting company? That could work, but it doesn’t scale very well, you want this to work in your pipeline. The primary tools to turn to is unit and integration testing.

We will not discuss the techniques for threat modeling in this post, but there are different techniques that can be applied. Keep it practical, don’t dive too deep into the details – it is better to start with a higher level view on things, and rather refine it as the design is matured. Here are some methods that can be applied in software threat modeling:

Often a STRIDE-like approach is a good start, and for the worst case scenarios it can be worthwhile diving into more detail with attack trees. An attack tree is a fault tree applied to adversarial modeling.

After the key threats have been identified, it is time to plan how to deal with that risk. We should apply the defense-in-depth principle, and remeber that a single security control is usually not enough to stop all attacks – because we do not know what all possible attack patterns are. When we have come up with mitigations for the threats we worry about, we need to validate that they actually work. This validation should happen at the lowest possible level – unit tests, integration tests. It is a good idea for the developer to run his or her own tests, but these validations definitely must live in the build pipeline.

Let’s consider a two-factor authentication flow using SMS-based two-factor authentication. This is the authentication for an application used by politicians, and there are skilled threat actors who would like to gain access to individual accounts.

A simple data flow diagram for a 2FA flow

Here’s how the authentication process work:

  • User connects to the domain and gets an single-page application loaded in the browser with a login form with username and password
  • The user enters credentials, that are sent as a post request to the API server, which validates it with stored credentials (hashed in a safe way) in a database. The API server only accepts requests from the right domain, and the DB server is not internet accessible.
  • When the correct credentials have been added, the SPA updates with a 2fa challenge, and the API server sends a post request to a third-party SMS gateway, which sends the token to the user’s cell phone.
  • The user enters the code, and if valid, is authenticated. A JWT is returned to the browser and stored in localstorage.

Let’s put on the dark hat and consider how we can take over this process.

  1. SIM card swapping combined wiht a phishing email to capture the credentials
  2. SIM card swapping combined with keylogger malware for password capture
  3. Phishing capturing both password and the second factor from a spoofed login page, and reusing credentials immediately
  4. Create an evil browser extension and trick the user to install it using social engineering. Use the browser extension to steal the token.
  5. Compromise a dependency used by the application’s frontend, to allow man-in-the-browser attacks that can steal the JWT after login.
  6. Compromise a dependency used in the API to give direct access to the API server and the database
  7. Compromise the 3rd party SMS gateway to capture credentials, use password captured with phishing or some other technique
  8. Exploit a vulnerability in the API to bypass authentication, either in a dependency or in the code itself.

As we see, the threat is the adversary getting access to a user account. There are many attack patterns that could be used, and only one of them involves only the code written in the application. If we are going to start planning mitigations here, we could first get rid of the two first problems by not using SMS for two-factor authenticaiton but rather relying on an authenticator app, like Google Authenticator. Test: no requests to the SMS gateway.

Phishing: avoid direct post requests from a phishing domain to the API server by only allowing CORS requests from our own domain. Send a verification email when a login is detected from an unknown machine. Tests: check that CORS from other domains fail, and check that an email is sent when a new login occurs.

Browser extensions: capture metadata/fingerprint data and detect token reuse across multiple machines. Test: same token in different browsers/machines should lead to detection and logout.

Compromised dependencies is a particularly difficult attack vector to deal with as the vulnerability is typically unknown. This is in practice a zero-day. For token theft, the mitigation of using meta-data is valid. In addition it is good practice to have a process for acceptance of third-party libraries beyond checking for “known vulnerabilities”. Compromise of the third-party SMS gateway is also difficult to deal with in the software project, but should be part of a supply chain risk management program – but this problem is solved by removing the third-party.

Exploit a vulnerability in the app’s API: perform static analysis and dependency analysis to minimize known vulnerabilities. Test: no high-risk vulnerabilities detected with static analysis or dependency checks.

We see that in spite of having many risk reduction controls in place, we do not cover everything that we know, and there are guaranteed to be attack vectors in use that we do not know about.

Sprint planning – keeping the threat model alive

Sometimes “secure development” methodologies receive criticims for “being slow”. Too much analysis, the sprint stops, productivity drops. This is obviously not good, so the question is rather “how can we make security a natural part of the sprint”? One answer to that, at least a partial one, is to have a threat model based on the overall architecture. When it is time for sprint planning, there are three essential pieces that should be revisited:

  • The user stories or story points we are addressing; do they introduce threats or points of attack not already accounted for?
  • Is the threat model we created still representative for what we are planning to implement? Take a look at the data flow diagram and see if anything has changed – if it has, evaluate if the threat model needs to be updated too.
  • Finally: for the threats relevant to the issues in the sprint backlog, do we have validation for the planned security controls?

Simply discussing these three issues would often be enough to see if there are more “known uknowns” that we need to take care of, and will allow us to update the backlog and test plan with the appropriate annotations and issues.

Coding: the mother of bugs after the design flaws have been agreed upon

The threat modeling as discussed above has as its main purpose to uncover “design flaws”. While writing code, it is perfectly possible to implement a flawed plan in a flawless manner. That is why we should really invest a lot of effort in creating a plan that makes sense. The other half of vulnerabilities are bugs – coding errors. As long as people are still writing code, and not some very smart AI, errors in code will be related to human factors – or human error, as it is popularly called. This often points the finger of blame at a single individual (the developer), but since none of us are working in vacuum, there are many factors that influence these bugs. Let us try to classify these errors (leaning heavily on human factors research) – broadly there are 3 classes of human error:

  • Slips: errors made due to lack of attention, a mishap. Think of this like a typo; you know how to spell a word but you make a small mistake, perhaps because your mind is elsewhere or because the keyboard you are typing on is unfamiliar.
  • Competence gaps: you don’t really know how to do the thing you are trying to do, and this lack of knowledge and practice leads you to make the wrong choice. Think of an inexperienced vehicle driver on a slippery road in the dark of the night.
  • Malicious error injection: an insider writes bad code on purpose to hurt the company – for example because he or she is being blackmailed.

Let’s leave the evil programmer aside and focus on how to minimize bugs that are created due to other factors. Starting with “slips” – which factors would influence us to make such errors? Here are some:

  • Not enough practice to make the action to take “natural”
  • High levels of stress
  • Lack of sleep
  • Task overload: too many things going on at once
  • Outside disturbances (noise, people talking to you about other things)

It is not obvious that the typical open office plan favored by IT firms is the optimal layout for programmers. Workload management, work-life balance and physical working environment are important factors for avoiding such “random bugs” – and therefore also important for the security of your software.

These are mostly “trying to do the right thing but doing it wrong” type of errors. Let’s now turn to the lack of competence side of the equation. Developers have often been trained in complex problem solving – but not necessarily in protecting software from abuse. Secure coding practices, such as how to avoid SQL injection, why you need output escaping and similar types of practical application secuity knowledge, is often not gained by studying computer science. It is also likely that a more self-taught individual would have skipped over such challenges, as the natural focus is on “solving the problem at hand”. This is why a secure coding practice must deliberately be created within an organization, and training and resources provided to teams to make it work. A good baseline should include:

  • How to protect aginst OWASP Top 10 type vulnerabilities
  • Secrets management: how to protect secrets in development and production
  • Detectability of cyber threats: application logging practices

An organization with a plan for this and appropriate training to make sure everyone’s on the same page, will stand a much better chance of avoiding the “competence gap” type errors.

Security testing in the build pipeline

OK, so you have planned your software, created a threat model, commited code. The CI/CD build pipeline triggers. What’s there to stop bad code from reaching your production environment? Let’s consider the potential locations of exploitable bugs in our product:

  • My code
  • The libraries used in that code
  • The environment where my software runs (typically a container in today’s world)

Obviously, if we are trying to push something with known critical errors in either of those locations to production, our pipeline should not accept that. Starting with our own code, a standard test that can uncover many bugs is “static analysis”. Depending on the rules you use, this can be a very good security control but it has limitations. Typically it will find a hardcoded password written as

var password = 'very_secret_password";

but it may not find this password if it isn’t a little bit smart:

var tempstring = 'something_that_may_be_just_a_string";

and yet it may throw an alert on

var password = getsecret();

just because the word “password” is in there. So using the right rules, and tuning them, is important to make this work. Static analysis should be a minimum test to always include.

The next part is our dependencies. Using libraries with known vulnerabilities is a common problem that makes life easy for the adversary. This is why you should always scan the code for external libraries and check if there are known vulnerabilitie sin them. Commercial vendors of such tools often refer to it as “software component analysis”. The primary function is to list all dependencies, check them against databases of known vulnerabilities, and create alerts accordingly. And break the build process based on threshold limits.

Also the enviornment we run on should be secure. When building a container image, make sure it does not contain known vulnerabilities. Using a scanner tool for this is also a good idea.

While static analysis is primarily a build step, testing for known vulnerabilities whether in code libraries or in the environment, should be done regulary to avoid vulnerabilities discovered after the code is deployed from remaining in production over time. Testing the inventory of dependencies against a database of known vulnerabiltiies regulary would be an effective control for this type of risk.

If a library or a dependency in the environment has been injected with malicious code in the supply chain, a simple scan will not identify it. Supply chain risk management is required to keep this type of threat under control, and there are no known trustworthy methods of automatically identifying maliciously injected code in third-party dependencies in the pipeline. One principle that should be followed with respect to this type of threat, however, is minimization of the attack surface. Avoid very deep dependency trees – like an NPM project 25000 dependencies made by 21000 different contributors. Trusting 21000 strangers in your project can be a hard sell.

Another test that should preferably be part of the pipeline, is dynamic testing where actual payloads are tested against injection points. This will typically uncover other vulnerabilities than static analysis will and is thus a good addition. Note that active scanning can take down infrastructure or cause unforeseen errors, so it is a good idea to test against a staging/test environment, and not against production infrastructure.

Finally – we have the tests that will validate the mitigations identified during threat modeling. Unit tests and integration tests for securtiy controls should be added to the pipeline.

Modern environments are usually defined in YAML files (or other types of config files), not by technicians drawing cables. The benefit of this, is that the configuration can be easily tested. It is therefore a good idea to create acceptance tests for your Dockerfiles, Helm charts and other configuration files, to avoid an insider from altering it, or by mistake setting things up to be vulnerable.

Security debt has a high interest rate

Technical debt is a curious beast: if you fail to address it it will compound and likely ruin your project. The worst kind is security debt: whereas not fixing performance issues, removing dead code and so on compunds like a credit card from your bank, leaving vulnerabilities in the code compunds like interest on money you lent from Raymond Reddington. Manage your debt, or you will go out of business based on a ransomware compaign followed by a GDPR fine and some interesting media coverage…

You need to plan for time to pay off your technical debt, in particular your securiyt debt.

Say you want to plan using a certain percentage of your time in a sprint on fixing technical debt, how do you choose which issues to take? I suggest you create a simple prioritization system:

  • Exposed before internal
  • Easy to exploit before hard
  • High impact before low impact

But no matter what method you use to prioritize, the most important thing is that you work on getting rid of known vulnerbilities as part of “business-as-usual”. To avoid going bankrupt due to overwhelming technical debt. Or being hacked.

Sometimes the action you need to take to get rid of a security hole can create other problems. Like installing an update that is not compatible with your code. When this is the case, you may need to spend more resources on it than a “normal” vulnerability because you need to do code rewrites – and that refactoring may also need you to update your threat model and risk mitigations.

Operations: your code on the battle field

In production your code is exposed to its users, and in part it may also be exposed to the internet as a whole. Dealing with feedback from this jungle should be seen as a key part of your vulnerability management program.

First of all, you will get access to logs and feedback from operations, whether it is performance related, bug detections or security incidents. It is important that you feed this into your issue management system and deal with it throughout sprints. Sometimes you may even have a critical situation requiring you to push a “hotfix” – a change to the code as fast as possible. The good thing about a good pipeline is that your hotfix will still go through basic security testing. Hopefully, your agile security process and your CI/CD pipeline is now working so well in symbiosis that it doesn’t slow your hotfix down. In other words: the “hotfix” you are pushing is just a code commit like all others – you are pushing to production several times a day, so how would this be any different?

Another aspect is feedback from incident response. There are two levels of incident response feedback that we should consider:

  1. Incident containment/eradication leading to hotfixes.
  2. Security improvements from the lessons learned stage of incident response

The first part we have already considered. The second part could be improvements to detections, better logging, etc. These should go into the product backlog and be handled during the normal sprints. Don’t let lessons learned end up as a PowerPoint given to a manager – a real lesson learned ends up as a change in your code, your environment, your documentation, or in the incident response procedures themselves.

Key takeaways

This was a long post, here are the key practices to take away from it!

  • Remember that vulnerabilities come from poor operational practices, flaws in design/architecture, and from bugs (implementation errors). Linting only helps with bugs.
  • Use threat modeling to identity operational and design weaknesses
  • All errors are human errors. A good working environment helps reduce vulnerabilities (see performance shaping factors).
  • Validate mitigations using unit tests and integration tests.
  • Test your code in your pipeline.
  • Pay off technical debt religiously.

Vacation’s over. The internet is still a dumpster fire.

This has been the first week back at work after 3 weeks of vacation. Vacation was mostly spent playing with the kids, relaxing on the beach and building a garden fence. Then Monday morning came and reality came back, demanding a solid dose of coffee.

  • Wave of phishing attacks. One of those led to a lightweight investigation finding the phishing site set up for credential capture on a hacked WordPress site (as usual). This time the hacked site was a Malaysian site set up to sell testosteron and doping products… and digging around on that site, a colleague of mine found the hackers’ uploaded webshell. A gem with lots of hacking batteries included.
  • Next task: due diligence of a SaaS vendor, testing password reset. Found out they are using Base64 encoded userID’s as “random tokens” for password reset – meaning it is possible to reset the password for any user. The vendor has been notified (they are hopefully working on it).
  • Surfing Facebook, there’s an ad for a productivity tool. Curious as I am I create an account, and by habit I try to set a very weak password (12345). The app accepts this. Logging in to a fancy app, I can then by forced browsing look at the data from all users. No authorization checks. And btw, there is no way to change your password, or reset it if you forget. This is a commercial product. Don’t forget to do some due diligence, people.

Phishing for credentials?

Phishing is a hacker’s workhorse, and for compromising an enterprise it is by far the most effective tool, especially if those firms are not using two-factor authentication. Phishing campaigns tend to come in bursts, and this needs to be handled by helpdesk or some other IT team. And with all the spam filters in the world, and regular awareness training, you can reduce the number of compromised accounts, but it is still going to succeed every single time. This is why the right solution to this is not to think that you can stop every malicious email or train every user to always be vigilant – the solution is primarily: multifactor authentication. Sure, it is possible to bypass many forms of it, but it is far more difficult to do than to just steal a username and a password.

Another good idea is to use a password manager. It will not offer to fill in passwords on sites that aren’t actually on the domain they pretend to be.

To secure against phishing, don’t rely on awareness training and spam filters only. Turn on 2FA and use a password manager for all passwords. #infosec

You do have a single sign-on solution, right?

Password reset gone wrong

The password reset thing was interesting. First on this app I registered an account with a Mailinator email account and the password “passw0rd”. Promising.. Then trying the “I forgot” on login to see if the password recovery flow was broken – and it really was in a very obvious way. Password reset links are typically sent by email. Here’s how it should work:

You are sent a one-time link to recover your password. The link should contain an unguessable token and should be disabled once clicked. The link should also expire after a certain time, for example one hour.

This one sent a link, that did not expire, and that would work several times in a row. And the unguessable token? Looked something like this: “MTAxMjM0”. Hm… that’s too short to really be a random sequence worth anything at all. Trying to identify if this is a hash or something encoded, the first thing we try is to decode from Base64 – and behold – we can a 6-digit number (101234 in this case, not the userID from this app). Creating a new account, and then doing the same reveals we get the next number (like 101235). In other words, using the reset link of the type /password/iforgot/token/MTAxMjM0, we can simply Base64 encode a sequence of numbers and reset the passwords for every user.

Was this a hobbyist app made by a hobbyist developer? No, it is an enterprise app used by big firms. Does it contain personal data? Oh, yes. They have been notified, and I’m waiting for feedback from them on how soon they will have deployed a fix.

Broken access control

The case with the non-random random reset token is an example of broken authentication. But before the week is over we also need an example of broken access control. Another web app, another dumpster fire. This was a post shared on social media that looked like an interesting product. I created an account. Password this time: 12345. It worked. Of course it did…

This time there is no password reset function to test, but I suspect if there had been one it wouldn’t have been better than the one just described above.

This app had a forced browsing vulnerability. It was a project tracking app. Logging in, and creating a project, I got an URL of the following kind: /project/52/dashboard. I changed 52 to 25 – and found the project goals of somebody planning an event in Brazil. With budgets and all. The developer has been notified.

Always check the security of the apps you would like to use. And always turn on maximum security on authentication (use a password manager, use 2FA everywhere). Don’t get pwnd. #infosec

CCSK Domain 1: Cloud Computing Concepts and Architecture

Recently I participated in a one-day class on the contents required for the “Certificate of Cloud Security Knowledge” held by Peter HJ van Eijk in Trondheim as part of the conference Sikkerhet og Sårbarhet 2019 (translates from Norwegian to: Security and Vulnerability 2019). The one-day workshop was interesting and the instructor was good at creating interactive discussions – making it much better than the typical PowerPoint overdose of commmercial professional training sessions. There is a certification exam that I have not yet taken, and I decided I should document my notes on my blog; perhaps others can find some use for them too.

The CCSK exam closely follows a document made by the Cloud Security Alliance (CSA) called “CSA Security Guidance for Critical Areas of Focus in Cloud Computing v4.0” – a document you can download for free from the CSA webpage. They also lean on ENISA’s “Cloud Computing Risk Assessment”, which is also a free download.

Cloud computing isn’t about who owns the compute resources (someone else’s computer) – it is about providing scale and cost benefits through rapid elasticity, self-service, shared resource pools and a shared security responsibility model.

The way I’ll do these blog posts is that I’ll first share my notes, and then give a quick comment on what the whole thing means from my point of view (which may not really be that relevant to the CCSK exam if you came here for a shortcut to that).

Introduction to D1 (Cloud Concepts and Architecture)

Domain 1 contains 4 sections:  

  • Defining cloud computing 
  • The cloud logical model 
  • Cloud conceptual, architectural and reference model 
  • Cloud security and compliance scope, responsibilities and models 

NIST definition of cloud computing: a model for ensuring ubiquitous, convenient, on-demand network access to a shared pool for configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. 

A Cloud User is the person or organization requesting computational resources. The Cloud Provider is the person or organization offering the resources. 

Key techniques to create a cloud:  

  • Abstraction: we abstract resources from the underlying infrastructure to create resource pools  
  • Orchestration: coordination of delivering resources out of the pool on demand.  

Clouds are multitenant by nature. Consumers are segregated and isolated but share resource pools.  

Cloud computing models 

The foundation model of cloud computing of the CSA is the NIST model. A more in-depth model used as a reference model is taken from ISO/IEC.  The guidance talks mostly about the NIST model and doesn’t dive into the ISO/IEC model, which probably is sufficient for most definition needs.

Cloud computing has 5 charcteristics:

  1. Shared resource pool (compute resources in a pool that consumers can pull from)
  2. Rapid elasticity (can scale up and down quickly)
  3. Broad network access
  4. On-demand self-service (management plane, API’s)
  5. Measured service (pay-as-you-go)

Cloud computing has 3 service models

  • Software as a Service (SaaS): like Cybehave or Salesforce
  • Platform as a Service (PaaS): like WordPress or AWS Elastic Beanstalk
  • Infrastructure as a Service (IaaS): like VM’s running in Google Cloud

Cloud computing has 4 deployment models:

  • Public Cloud: pool shared by anyone
  • Private Cloud: pool shared within an organization
  • Hybrid Cloud: connection between two clouds, commonly used when an on-prem datacenter connects to a public cloud
  • Community Cloud: pool shared by a community, for example insurance companies that have formed some form of consortium

Models for discussing cloud security

The CSA document discusses multiple model type in a somewhat incoherent manner. The types of models it mentions can be categorized as follows:

  • Conceptual models: descriptions to explain concepts, such as the logic model from CSA.  
  • Controls models: like CCM 
  • Reference architectures: templates for implementing security controls 
  • Design patterns: solutions to particular problems 

The document also outlines a simple cloud security process model 

  • Identify security and compliance requirements, and existing controls 
  • Select provider, service and deployment models 
  • Define the architecture 
  • Assess the security controls 
  • Identify control gaps 
  • Design and implement controls to fill gaps 
  • Manage changes over time 

The CSA logic model

This model explains 4 “layers” of a cloud enviornment and introduces some “funny words”:

  • Infrastructure: the core components in computing infrastructure, such as servers, storage and networks 
  • Metastructure: protocols and mechanisms providing connections between infrastructure and the other layers 
  • Infostructure: The data and information (database records, file storage, etc) 
  • Applistructure: The applications deployed in the cloud and the underlying applications used ot build them. 

The key difference between traditional IT and cloud is the metastructure. Cloud metastructure contains the management plane components.  

Another key feature of cloud is that each layer tends to double. For example infrastructure is managed by the cloud provider, but the cloud consumer will establish a virtual infrastructure that will also need ot be managed (at least in the case of IaaS). 

Cloud security scope and responsibilities 

The responsibility for security domains maps to the access the different stakeholders have to each layer in the architecture stack.  

  • SaaS: cloud provider is responsible for perimeter, logging, and application security and the consumer may only have access to provision users and manage entitlemnets 
  • PaaS: the provider is typically responsible for platform security and the consumer is responsible for the security of the solutions deployed on the platform. Configuring the offered security features is often left to the consumer.  
  • IaaS: cloud provider is responsible for hypervisors, host OS, hardware and facilities, consumer for guest OS and up in the stack.  

Shared responsibility model leaves us with two focus areas:  

  • Cloud providers should clearly document internal security management and security controls available to consumers.  
  • Consumers should create a responsibility matrix to make sure controls are followed up by one of the parties 

Two compliance tools exist from the CSA and are recommended for mapping security controls:  

  • The Consensus Assessment Initiative Questionnaire (CAIQ) 
  • The Cloud Controls Matrix (CCM) 

#2cents

This domain is introductory and provides some terminology for discussing cloud computing. The key aspects from a risk management point of view are:

  • Cloud creates new risks that need to be managed, especially as it introduces more companies involved in maintaining security of the full stack compared to a full in-house managed stack. Requirements, contracts and audits become important tools.
  • The NIST model is more or less universally used in cloud discussions in practice. The service models are known to most IT practitioners, at least on the operations side.
  • The CSA guidance correctly designates the “metastructure” as the new kid on the block. The practical incarnation of this is API’s and console access (e.g. gcloud at API level and Google Cloud Console on “management plane” level). From a security point of view this means that maintaining security of local control libraries becomes very important, as well as identity and access management for the control plane in general.

In addition to the “who does what” problem that can occur with a shared security model, the self-service and fast-scaling properties of cloud computing often lead to “new and shiny” being pushed faster than security is aware of. An often overlooked part of “pushing security left” is that we also need to push both knowledge and accountability together with the ability to access the management plane (or parts of it through API’s or the cloud management console).

Does cyber insurance make sense?

Insurance relies on pooled risk; when a business is exposed to a risk it feels is not manageable with internal controls, the risk can be deferred to the capital markets through an insurance contract. For events that are unlikely to hit a very large number of insurance customers at once, this model makes sense. The pooled risk allows the insurer to create capital gains on the premiums paid by the customers, and the customers get their financial losses covered in case of a claim situation. This works very well for many cases, but insurers will naturally try to limit their liabilities, through “omissions clauses”; things that are not covered by the insurance policy. The omissions will typically include catastrophic systemic events that the insurance pool would not have the means to cover because a large number of customers would be hit simultaneously. It will also include conditions with the individual customer causing the insurance coverage to be voided – often referred to as pre-existing conditions. A typical example of the former is damages due to acts of war, or natural disasters. For these events, the insured would have to buy extra coverage, if at all offered. An example of the latter omission type, the pre-existing condition, would be diseases the insured would have before entering into a health insurance contract.

20150424_155646090_iOS
Risk pooling works well for protecting the solvency of insurers when issuing policies covering rare independent events with high individual impact – but is harder to design for covering events where there is systemic risk involved. Should insurers cover the effects of large-scale virus infections like WannaCry over normal cyber-insurance policies? Can they? What would the financial aggregate cost of a coordinated cyber-attack be on a society when major functions collapse – such as the Petya case in the Ukraine? Can insurers cover the cost of such events?

How does this translate into cyber insurance? There are several interesting aspects to think about, in both omissions categories. Let us start with the systemic risk – what happens to the insurance pool if all customers issue claims simultaneously? Each claim typically exceed the premiums paid by any one single customer. Therefore, a cyberattack that spreads to large portions of the internet are hard to insure while keeping the insurer’s risk under control. Take for example the WannaCry ransomware attack in May; within a day more than 200.000 computers in 150 countries were infected. The Petya attack following in June caused similar reactions but the number of infected computers is estimated to be much lower. As the WannaCry still looks like a poorly implemented cybercrime campaign intended to make money for the attacker, the Petya ransomware seems to have been a targeted cyberweapon used against  the Ukraine; the rest was collateral damage, most likely. But for Ukrainian companies, the government and computer users this was a major attack: it took down systems belonging to critical infrastructure providers, it halted airport operations, it affected the government, it took hold of payment terminals in stores; the attack was a major threat to the entire society. What could a local insurer have done if it had covered most of those entities against any and all cyberattacks? It would most likely not have been able to pay out, and would have gone bankrupt.

A case that came up in security forums after the WannaCry attack was “pre-existing condition” in cyber insurance. Many policies had included “human error” in the omissions clauses; basically, saying that you are not covered if you are breached through a phishing e-mail.  Some policies also include an “unpatched system” as an omission clause; if you have not patched, you are not covered. Not all policies are like that, and underwriters will typically gauge a firm’s cyber security maturity before entering into an insurance contract covering conditions that are heavily influenced by security culture. These are risks that are hard to include in quantitative risk calculations; the data are simply not there.

Insurance is definitely a reasonable control for mature companies, but there is little value in paying premiums if the business doesn’t satisfy the omissions clauses. For small businesses it will pay off to focus on the fundamentals first, and then to expand with a reasonable insurance policy.

For insurance companies it is important to find reasonable risk pooling measures to better cover large-scale infections like WannaCry. Because this is a serious threat to many businesses, not having meaningful insurance products in place will hamper economic growth overall. It is also important for insurers to get a grasp on individual omissions clauses – because in cyber risk management the thinking around “pre-existing condition” is flawed – security practice and culture is a dynamic and evolving thing, which means that the coverage omissions should be based on current states rather than a survey conducted prior to policy renewal.

 

Handling suppliers with low security awareness

Supply chain risk – in cyberspace

Cyber supply chain risk is a difficult area to manage. According to NIST 80% of all breaches originate in the supply chain, meaning it should be a definite priority of any security conscious organization to try and manage that risk. That number was given in a presentation by Jon Boyens at the 2016 RSA conference. A lot of big companies have been breached due to suppliers with poor information security practices, for example Target and Home Depot.

supplychainexpansion
Your real attack surface includes the people you do business with – and those that they do business with again. And this is not all within your span of control!

Most companies do not have any form of cybersecurity screening of their suppliers. Considering the facts above, this seems like a very bad idea. Why is this so?

A lot of people think cybersecurity is difficult. The threat landscape itself is difficult to assess unless you have the tools and knowledge to do so. Most companies don’t have that sort of competence in-house, and they are often unaware that they are lacking know-how in a critical risk governance area.

Why are suppliers important when it comes to cybersecurity? The most important factor is that you trust your suppliers, and you may already have shared authentication secrets with them. Consider the following scenarios;

  1. Your HVAC service provider has VPN access to you network in order to troubleshoot the HVAC system in your office. What if hackers gain control over your HVAC vendor’s computer? Then they also have access to your network.
  2. A supplier that you frequently communicate with has been hacked. You receive an email from one of your contacts in this firm, asking if you can verify your customer information by logging into their web based self-service solution. What is the chance you would do that, provided the web page looks professional? You would at least click the link.
  3. You are discussing a contract proposal with a supplier. After emailing back and forth about the details for a couple of weeks he sends you a download link to proposed contract documents from his legal department. Do you click?

All of these are real use cases. All of them were successful for the cybercriminals wanting access to a bigger corporation’s network. The technical set-up was not exploited; in the HVAC case the login credentials of the supplier was stolen and abused (this was the Target attack resulting in leak of 70 million customer credit cards). In the other two cases an existing trust relationship was used to increase the credibility of a spear-phishing attack.

To counter social engineering, most companies offer “cybersecurity awareness training”. That can be helpful, and it can reduce how easy it is to trick employees into performing dangerous actions. When the criminals leverage an existing trust relationship, this kind of training is unlikely to have any effect. Further, your awareness training is probably only including your own organization. Through established buyer-supplier relationships the initial attack surface is not only your own organization; it is expanded to include all the organizations you do business with. And their attack surface again, includes of the people they do business with. This quickly expands to a very large network. You can obviously not manage the whole network – but what you can do is to evaluate the risk of using a particular supplier, and use that to determine which security controls to apply to the relationship with that supplier.

Screening the contextual risk of supplier organizations

What then determines the supplier risk level? Obviously internal affairs within the supplier’s organization is important but at least in the early screening of potential suppliers this information is not available. The supplier may also be reluctant to reveal too much information about his or her company. This means you can only evaluate the external context of the supplier. As it turns out, there are several indicators you can use to gauge the likelihood of a supplier breach. Main factors include:

  • Main locations of the supplier’s operations, including corporate functions
  • The size of the company
  • The sector the company operates in

In addition to these factors, which can help determine how likely the organization is to be breached, you should consider what kind of information about your company the supplier would possess. Obviously, somebody with VPN login credentials to your network would be of more concern than a restaurant where you order overtime food for you employees. Of special concern should be suppliers or partners with access to critical business secrets, with login credentials, or with access to critical application programming interfaces.

Going back to the external context of the supplier; why is the location of the supplier’s operations important? It turns out that the amount of malware campaigns a company is exposed to is strongly correlated with the political risk in the countries where the firm operates. Firms operating in countries with a high crime rate, significant corruption and dubious attitudes to democracy and freedom of speech, also tend to be attacked more from the outside. They are also more likely to have unlicensed software, e.g. pirated versions of Windows – leaving them more vulnerable to those attacks.

The size of the company is also an interesting indicator. Smaller companies, i.e. less than 250 employees, have a lower fraction of their incoming communication being malicious. At the same time, the defense of these companies is often weak; many of them lack processes for managing information security, and a lot of companies in this group do not have internal cybersecurity expertise.

The medium sized companies (250-500 employees) receive more malicious communications. These companies often lack formal cybersecurity programs too, and competence may also be missing here, especially on the process side of the equation. For example, few companies in this category have established an information security management system.

Larger companies still receive large amounts of malicious communications but they tend to have better defense systems, including management processes. The small and medium sized business therefore pose a higher threat for value chain exploitation than larger more established companies do.

Also, the sector the supplier operates is a determining factor for the external context risk.  Sectors that are particularly exposed to cyberattacks include:

  • Retail
  • Public sector and governmental agencies
  • Business services (consulting companies, lawyers, accountants, etc.)

Here the topic of “what information do I share” comes in. You are probably not very likely to share internal company data with a retailer unless you are part of the retailers supply chain. If you are, then you should be thinking about some controls, especially if the retailer is a small or medium sized business.

For many companies the “business services” category is of key interest. These are service providers that you would often share critical information with. Consulting companies gain access to strategic information, to your IT network, gets to know a lot of key stakeholders in your company. Lawyers would obviously have access to confidential information. Accountants would be trusted, have access to information and perhaps also to your ERP systems. Business service providers often get high levels of access, and they are often targeted by cybercriminals and other hackers; this is good reason to be vigilant with managing security in the buyer-supplier relationship.

Realistic assessments require up to date threat intelligence

There are more factors that come into play when selecting a supplier for your firm than security. Say you have an evaluation scheme that takes into account:

  • Financials
  • Capacity
  • Service level
  • And now… cybersecurity

If the risk is considered unreasonably high for using a supplier, you may end up selecting a supplier that is more expensive, or where the level of service is lower, than for the “best” supplier but with a high perceived risk. Therefore it becomes important that the contextual coarse risk assessment is performed based on up-to-date threat models, even for the macro indicators discussed above.

Looking at historical data shows that the threat impact of company size remains relatively stable over time. Big companies tend to have better governance than small ones. On the positive side for smaller companies is that they tend to be more interested in cooperating on risk governance than bigger players are. This, however, is usually not problematic when it comes to understanding the threat context.

Political risk is more volatile. Political changes in countries can happen quickly, and the effects of political change can be subtle but important for cybersecurity context. This factor depends on up to date threat intelligence, primarily available from open sources. This means that when you establish a contextual threat model, you should take care to keep it up to date with political risk factors that do change at least on a quarterly basis, and can even change abruptly in the case of revolutions, terror attacks or other major events causing social unrest. A slower stream would be legislative processes that affect not only how businesses deal with cyber threats but also on the governmental level. Key uncertainties in this field today include the access of intelligence organizations to communications data, and the evolvement of privacy laws.

Also the sector influence on cyber threat levels do change dynamically. Here threat intelligence is not that easy to access but some open sources do exist. Open intel sources that can be taken into account to adjust the assessment of business sector risk are:

  • General business news and financial market trends
  • Threat intelligence reports from cybersecurity firms
  • Company annual reports
  • Regulations affecting the sector, as also mentioned under political risk
  • Vulnerability reports for business critical software important to each sectoor

In addition to this, less open sources of interest would be:

  • Contacts working within the sectors with access to trend data on cyber threats (e.g. sysadmins in key companies’ IT deparments)
  • Sensors in key networks (often operated by government security organizations), sharing of information typically occurs in CERT organizations

Obviously, staying on top of the threat landscape is a challenging undertaking but failing to do so can lead to weak risk assessments that would influence business decisions the wrong way. Understanding the threat landscape is thus a business investment where the expected returns are long-term and hard to measure.

How to take action

How should you, as a purchaser, use this information about supplier threats? Considering now the situation where you have access to a sound contextual threat model, and you are able to sort supplier companies into broad risk categories, e.g. low, medium, high risk categories. How can you use that information to improve your own risk governance and reduce the exposure to supply chain cyber threats?

First, you should establish a due diligence practice for cybersecurity. You should require more scrutiny for high-risk situations than low-risk ones. Here is one way to categorize the governance practices for supply chain cyber risks – but this is only a suggested structure. The actual activities should be adapted to your company’s needs and capabilities.

Practice Low risk supplier Medium risk supplier High risk supplier
Require review of supplier’s policy for information security No Yes Yes
State minimum supplier security requirements (antivirus, firewalls, updated software, training) Yes Yes Yes
Require right to audit supplier for cybersecurity compliance No To be considered Yes
Establish cooperation for incident handling No To be considered Yes
Require external penetration test including social engineering prior to and during business relationship No No To be considered
Agree on communication channels for security incidents related to buyer-supper relationship Yes Yes Yes
Require ISO 27001 or similar certification No No To be considered

If you found this post interesting, please share it with your contacts – and let me know what you think in the comments!

Why “secure iframes” on http sites are bad for security

Earlier this year it was reported that half of the web is now served over SSL (Wired.com). Still, quite a number of sites are trying to keep things in http, and to serve secure content in embedded parts of the site. There are two approaches to this:

  • A form embedded in an iframe served over https (not terrible but still a bad idea)
  • A form that loads over http and submits over https (this is terrible)

The form loading on the http site and submitting to a https site is security-wise meaningless because an attacker can read the data entered into the form on the web page. This means the security added by https is lost because a man-in-the-middle attacker on the http site can snoop on the data in the form directly.

ssl_safecontrols
Users are slowly but surely being trained to look for this padlock symbol and the “https” protocol when interacting with web pages and applications. 

The “secure iframe” is slightly better because the form is served over https and a man-in-the-middle cannot easily read the contents of the form. This is aided by iframe sandboxing in modern browsers (see some info about this in Chrome here), although old ones may not be as secure because the sandboxing function was not included. Client-side restrictions can, however, be manipulated.

One of the big problems with security is lack of awareness about security risks. To counter this, browsers today indicate that login forms, payment forms, etc. on http sites are insecure. If you load your iframe over https on an http site, the browser will still warn the user (although the actual content is not submitted insecurely). This counteracts the learned (positive) behavior of looking for a green padlock symbol and the https protocol. Two potential bad effects:

  • Users start to ignore the unison cry of “only submit data when you see the green padlock” – which will be great for phishing agents and other scammers. This may be “good for business” in the short run, but it certainly is bad for society as a whole, and for your business in the long run.
  • Users will not trust your login form because it looks insecure and they choose not to trust your site – which is good for the internet and bad for your business.

Takeaways from this:

  • Serve all pages that interact with users in any form over https
  • Do not use mixed content in the same page. Just don’t do it.
  • While you are at it: don’t support weak ciphers and vulnerable crypto. That is also bad for karma, and good for criminals.

How do leaked cyber weapons change the threat landscape for businesses?

Recently, a group called Shadow Brokers released hundreds of megabytes of tools claimed to be stemming from the NSA and other intelligence organizations. Ars has written extensively on the subject: https://arstechnica.com/security/2017/04/nsa-leaking-shadow-brokers-just-dumped-its-most-damaging-release-yet/. The leaked code is available on github.com/misterch0c/shadowbroker. The exploits target several Microsoft products still in service (and commonly used), as well as the SWIFT banking network. Adding speculation to the case is the fact that Microsoft silently released patches to vulnerabilities claimed to be zerodays in the leaked code prior to the actual leak. But what does all of this mean for “the rest of us”?

shadowbroker
Analysis shows that lifecycle management of software needs to be proactive, considering the security features of new products against the threat landscape prior to end-of-life for existing systems as a best practice. The threat from secondary adversaries may be increasing due to availability of new tools, and the intelligence agencies have also demonstrated willingness to target organizations in “friendly” countries; nation state actors should thus include domestic ones in threat modeling. 

There are two key questions we need to ask and try to answer:

  1. Should threat models include domestic nation state actors, including illegal use of intelligence capabilities against domestic targets?
  2. Does the availability of the leaked tools increase the threat from secondary actors, e.g. organized crime groups?

Taking on the first issue first: should we include domestic intelligence in threat models for “normal” businesses? Let us examine the C-I-A security triangle from this perspective.

  • Confidentiality: are domestic intelligence organizations interested in stealing intellectual property or obtaining intelligence on critical personnel within the firm? This tends to be either supply chain driven if you are not yourself the direct target, or data collection may occur due to innocent links to other organization’s that are being targeted by the intelligence unit.
  • Integrity (data manipulation): if your supply chain is involved in activities drawing sufficient attention to require offensive operations, including non-cyber operations, integrity breaches are possible. Activities involving terrorism funding or illegal arms trade would increase the likelihood of such interest from authorities.
  • Availability: nation state actors are not the typical adversary that will use DoS-type attacks, unless it is to mask other intelligence activities by drawing response capabilities to the wrong frontier.

The probability of APT activities from domestic intelligence is for most firms still low. The primary sectors where this could be a concern are critical infrastructure and financial institutions. Also firms involved in the value chains of illegal arms trade, funding of terrorism or human trafficking are potential targets but these firms are often not aware of their role in the illegal business streams of their suppliers and customers.

The second question was if the leak poses an increased threat from other adversary types, such as organized crime groups. Organized crime groups run structured operations across multiple sectors, both legal and illegal. They tend to be opportunistic and any new tools being made available that can support their primary cybercrime activities will most likely be made use of quickly. The typical high-risk activities include credit card and payment fraud, document fraud and identity theft, illicit online trade including stolen intellectual property, and extortion schemes by direct blackmail or use of malware. The leaked tools can support several of these activities, including extortion schemes and information theft. This indicates that the risk level does in fact increase with the leaks of additional exploit packages.

How should we now apply this knowledge in our security governance?

  • The tools use exploits in older versions of operating systems. Keeping systems up-to-date remains crucial. New versions of Windows tend to come with improved security. Migration prior to end-of-life of previous version should be considered.
  • In risk assessments, domestic intelligence should be considered together with foreign intelligence and proxy actors. Stakeholder and value chain links remain key drivers for this type of threat.
  • Organized crime: targeted threats are value chain driven. Most likely increased exposure due to new cyberweapons available to the organized crime groups for firms with exposure and old infrastructure.