DevSecOps: Embedded security in agile development

The way we write, deploy and maintain software has changed greatly over the years, from waterfall to agile, from monoliths to microservices, from the basement server room to the cloud. Yet, many organizations haven’t changed their security engineering practices – leading to vulnerabilities, data breaches and lots of unpleasantness. This blog post is a summary of my thoughts on how security should be integrated from user story through coding and testing and up and away into the cyber clouds. I’ve developed my thinking around this as my work in the area has moved from industrial control systems and safety critical software to cloud native applications in the “internet economy”.

What is the source of a vulnerability?

At the outset of this discussion, let’s clarify two common terms, as they are used by me. In very unacademic terms:

  • Vulnerability: a flaw in the way a system is designed and operated, that allows an adversary to perform actions that are not intended to be available by the system owner.
  • A threat: actions performed on an asset in the system by an adversary in order to achieve an outcome that he or she is not supposed to be able to do.

The primary objective of security engineering is to stop adversaries from being able to achieve their evil deeds. Most often, evilness is possible because of system flaws. How these flaws end up in the system, is important to understand when we want to make life harder for the adversary. Vulnerabilities are flaws, but not all flaws are vulnerabilities. Fortunately, quality management helps reduce defects whether they can be exploited by evil hackers or not. Let’s look at three types of vulnerabilities we should work to abolish:

  • Bugs: coding errors, implementation flaws. The design and architecture is sound, but the implementation is not. A typical example of this is a SQL injection vulnerability in a web app.
  • Design flaws: errors in architecture and how the system is planned to work. A flawed plan that is implemented perfectly can be very vulnerable. A typical example of this is a broken authorization scheme.
  • Operational flaws: the system makes it hard for users to do things correctly, making it easier to trick privileged users to perform actions they should not. An example would be a confusing permission system, where an adversary uses social engineering of customer support to gain privilege escalation.

Security touchpoints in a DevOps lifecycle

Traditionally there has been a lot of discussion on a secure development lifecycle. But our concern is removing vulnerabilities from the system as a whole, so we should follow the system from infancy through operations. The following touchpoints do not make up a blueprint, it is an overview of security aspects in different system phases.

  • Dev and test environment:
    • Dev environment helpers
    • Pipeline security automation
    • CI/CD security configuration
    • Metrics and build acceptance
    • Rigor vs agility
  • User roles and stories
    • Rights management
  • Architecture: data flow diagram
    • Threat modeling
    • Mitigation planning
    • Validation requirements
  • Sprint planning
    • User story reviews
    • Threat model refinement
    • Security validation testing
  • Coding
    • Secure coding practices
    • Logging for detection
    • Abuse case injection
  • Pipeline security testing
    • Dependency checks
    • Static analysis
    • Mitigation testing
      • Unit and integration testing
      • Detectability
    • Dynamic analysis
    • Build configuration auditing
  • Security debt management
    • Vulnerability prioritization
    • Workload planning
    • Compatibility blockers
  • Runtime monitoring
    • Feedback from ops
    • Production vulnerability identification
    • Hot fixes are normal
    • Incident response feedback

Dev environment aspects

If an adversary takes control of the development environment, he or she can likely inject malicious code in a project. Securing that environment becomes important. The first principle should be: do not use production data, configurations or servers in development. Make sure those are properly separated.

The developer workstation should also be properly hardened, as should any cloud accounts used during development, such as Github, or a cloud based build pipeline. Two-factor auth, patching, no working on admin accounts, encrypt network traffic.

The CI/CD pipeline should be configured securely. No hard-coded secrets, limit who can access them. Control who can change the build config.

During early phases of a project it is tempting to be relaxed with testing, dependency vulnerabilities and so on. This can quickly turn into technical debt – first in one service, then in many, and at the end there is no way to refinance your security debt at lower interest rates. Technical debt compounds like credit card debt – so manage it carefully from the beginning. To help with this, create acceptable build thresholds, and a policy on lifetime of accepted poor metrics. Take metrics from testing tools and let them guide: complexity, code coverage, number of vulnerabilities with CVSS above X, etc. Don’t select too many KPI’s, but don’t allow the ones you track to slip.

One could argue that strict policies and acceptance criteria will hurt agility and slow a project down. Truth is that lack of rigor will come back to bite us, but at the same time too much will indeed slow us down or even turn our agility into a stale bureaucracy. Finding the right balance is important, and this should be informed by context. A system processing large amounts of sensitive personal information requires more formalism and governance than a system where a breach would have less severe consequences. One size does not fit all.

User roles and stories

Most systems have diffent types of users with different needs – and different access rights. Hackers love developers who don’t plan in terms of user roles and stories – the things each user would need to do with the system, because lack of planning often leads to much more liberal permissions “just in case”. User roles and stories should thus be a primary security tool. Consider a simple app for approval of travel expenses in a company. This app has two primary user types:

  • Travelling salesmen who need reimbursements
  • Bosses who will approve or reject reimbursement claims

In addition to this, someone must be able of adding and removing users, granting access to the right travelling salesmen for a given boss, etc. The system also needs an Administrator, with other words.

Let’s take the travelling salesman and look at “user stories” that this role would generate:

  • I need to enter my expenses into a report
  • I need to attach documentation such as receipts to this report
  • I need to be able of sending the report to the boss for approval
  • I want to see the approval status of my expense report
  • I need to recieve a notification if my report is not approved
  • I need to be able of correcting any mistakes based on the rejection

Based on this, it is clear that the permissions of the “travelling salesman” role only needs to give write access to some operations, for data relating to this specific user, and needs read rights on the status of the approval. This goes directly into our authorization concept for the app, and already here generates testable security annotations:

  • A travelling salesman should not be able to read the expense report of another travelling salesman
  • A travellign salesman should not be able of approving expense reports, including his own

These negative unit tests could already go into the design as “security annotations” for the user stories.

In addition to user stories, we have abusers and abuse stories. This is about the type of adversaries, and what they would like to do, that we don’t want them to be able of achieving. Let’s take as an example a hacker hired by a competitor to perform industrial espionage. We have the adversary role “industrial espionage”. Here are some abuse cases we can define that relate to motivation of a player rather than technical vulnerabilities:

  • I want to access all travel reports to map where the sales personnel of the firm are going to see clients
  • I want to see the financial data approved to gauge the size of their travel budget, which would give me information on the size of their operation
  • I’d like to find names of people from their clients they have taken out to dinner, so we know who they are talking to at potential client companies
  • I’d like to get user names and personal data that allow med to gauge if some of the employees could be recurited as insiders or poached to come work for us instead

How is this hypothetical information useful for someone designing an app to use for expense reporting? By knowing the motivations of the adversaries we can better gauge the credibility that a certain type of vulnerability will be attempted exploited. Remember: Vulnerabilities are not the same as threats – and we have limited resources, so the vulnerabilities that would help attackers achieve their goals are more important to remove than those that cannot easily help the adversary.

Vulnerabilities are not the same as threats – and we have limited resources, so the vulnerabilities that would help attackers achieve their goals are more important to remove than those that cannot easily help the adversary.

Architecture and data flow diagrams

Coming back to the sources of vulnerabilities, we want to avoid vulnerabilities of three kinds; software bugs, software design flaws, and flaws in operating procedures. Bugs are implementation errors, and the way we try to avoid them is by managing competence, workload and stress level, and by use of automated security testing such as static analysis and similar tools. Experience from software reliability engineering shows that about 50% of software flaws are implementation erorrs – the rest would then be design flaws. These are designs and architectures that do not implement the intentions of the designer. Static analysis cannot help us here, because there may be no coding errors such as lack of exception handling or lack of input validation – it is just the concept that is wrong; for example giving a user role too many privileges, or allowing a component to talk to a component it shouldn’t have access to. A good tool for identificaiton of such design flaws is threat modeling based on a data flow diagram. Make a diagram of the software data flow, break it down into components on a reasonable level, and consider how an adversary could attack each component and what could be the impact of this. By going through an excercise like this, you will likely identify potential vulnerabilities and weaknesses that you need to handle. The mitigations you introduce may be various security controls – such as blocking internet access for a server that only needs to be available on the internal network. The next question then is – how do you validate that your controls work? Do you order a penetration test form a consulting company? That could work, but it doesn’t scale very well, you want this to work in your pipeline. The primary tools to turn to is unit and integration testing.

We will not discuss the techniques for threat modeling in this post, but there are different techniques that can be applied. Keep it practical, don’t dive too deep into the details – it is better to start with a higher level view on things, and rather refine it as the design is matured. Here are some methods that can be applied in software threat modeling:

Often a STRIDE-like approach is a good start, and for the worst case scenarios it can be worthwhile diving into more detail with attack trees. An attack tree is a fault tree applied to adversarial modeling.

After the key threats have been identified, it is time to plan how to deal with that risk. We should apply the defense-in-depth principle, and remeber that a single security control is usually not enough to stop all attacks – because we do not know what all possible attack patterns are. When we have come up with mitigations for the threats we worry about, we need to validate that they actually work. This validation should happen at the lowest possible level – unit tests, integration tests. It is a good idea for the developer to run his or her own tests, but these validations definitely must live in the build pipeline.

Let’s consider a two-factor authentication flow using SMS-based two-factor authentication. This is the authentication for an application used by politicians, and there are skilled threat actors who would like to gain access to individual accounts.

A simple data flow diagram for a 2FA flow

Here’s how the authentication process work:

  • User connects to the domain and gets an single-page application loaded in the browser with a login form with username and password
  • The user enters credentials, that are sent as a post request to the API server, which validates it with stored credentials (hashed in a safe way) in a database. The API server only accepts requests from the right domain, and the DB server is not internet accessible.
  • When the correct credentials have been added, the SPA updates with a 2fa challenge, and the API server sends a post request to a third-party SMS gateway, which sends the token to the user’s cell phone.
  • The user enters the code, and if valid, is authenticated. A JWT is returned to the browser and stored in localstorage.

Let’s put on the dark hat and consider how we can take over this process.

  1. SIM card swapping combined wiht a phishing email to capture the credentials
  2. SIM card swapping combined with keylogger malware for password capture
  3. Phishing capturing both password and the second factor from a spoofed login page, and reusing credentials immediately
  4. Create an evil browser extension and trick the user to install it using social engineering. Use the browser extension to steal the token.
  5. Compromise a dependency used by the application’s frontend, to allow man-in-the-browser attacks that can steal the JWT after login.
  6. Compromise a dependency used in the API to give direct access to the API server and the database
  7. Compromise the 3rd party SMS gateway to capture credentials, use password captured with phishing or some other technique
  8. Exploit a vulnerability in the API to bypass authentication, either in a dependency or in the code itself.

As we see, the threat is the adversary getting access to a user account. There are many attack patterns that could be used, and only one of them involves only the code written in the application. If we are going to start planning mitigations here, we could first get rid of the two first problems by not using SMS for two-factor authenticaiton but rather relying on an authenticator app, like Google Authenticator. Test: no requests to the SMS gateway.

Phishing: avoid direct post requests from a phishing domain to the API server by only allowing CORS requests from our own domain. Send a verification email when a login is detected from an unknown machine. Tests: check that CORS from other domains fail, and check that an email is sent when a new login occurs.

Browser extensions: capture metadata/fingerprint data and detect token reuse across multiple machines. Test: same token in different browsers/machines should lead to detection and logout.

Compromised dependencies is a particularly difficult attack vector to deal with as the vulnerability is typically unknown. This is in practice a zero-day. For token theft, the mitigation of using meta-data is valid. In addition it is good practice to have a process for acceptance of third-party libraries beyond checking for “known vulnerabilities”. Compromise of the third-party SMS gateway is also difficult to deal with in the software project, but should be part of a supply chain risk management program – but this problem is solved by removing the third-party.

Exploit a vulnerability in the app’s API: perform static analysis and dependency analysis to minimize known vulnerabilities. Test: no high-risk vulnerabilities detected with static analysis or dependency checks.

We see that in spite of having many risk reduction controls in place, we do not cover everything that we know, and there are guaranteed to be attack vectors in use that we do not know about.

Sprint planning – keeping the threat model alive

Sometimes “secure development” methodologies receive criticims for “being slow”. Too much analysis, the sprint stops, productivity drops. This is obviously not good, so the question is rather “how can we make security a natural part of the sprint”? One answer to that, at least a partial one, is to have a threat model based on the overall architecture. When it is time for sprint planning, there are three essential pieces that should be revisited:

  • The user stories or story points we are addressing; do they introduce threats or points of attack not already accounted for?
  • Is the threat model we created still representative for what we are planning to implement? Take a look at the data flow diagram and see if anything has changed – if it has, evaluate if the threat model needs to be updated too.
  • Finally: for the threats relevant to the issues in the sprint backlog, do we have validation for the planned security controls?

Simply discussing these three issues would often be enough to see if there are more “known uknowns” that we need to take care of, and will allow us to update the backlog and test plan with the appropriate annotations and issues.

Coding: the mother of bugs after the design flaws have been agreed upon

The threat modeling as discussed above has as its main purpose to uncover “design flaws”. While writing code, it is perfectly possible to implement a flawed plan in a flawless manner. That is why we should really invest a lot of effort in creating a plan that makes sense. The other half of vulnerabilities are bugs – coding errors. As long as people are still writing code, and not some very smart AI, errors in code will be related to human factors – or human error, as it is popularly called. This often points the finger of blame at a single individual (the developer), but since none of us are working in vacuum, there are many factors that influence these bugs. Let us try to classify these errors (leaning heavily on human factors research) – broadly there are 3 classes of human error:

  • Slips: errors made due to lack of attention, a mishap. Think of this like a typo; you know how to spell a word but you make a small mistake, perhaps because your mind is elsewhere or because the keyboard you are typing on is unfamiliar.
  • Competence gaps: you don’t really know how to do the thing you are trying to do, and this lack of knowledge and practice leads you to make the wrong choice. Think of an inexperienced vehicle driver on a slippery road in the dark of the night.
  • Malicious error injection: an insider writes bad code on purpose to hurt the company – for example because he or she is being blackmailed.

Let’s leave the evil programmer aside and focus on how to minimize bugs that are created due to other factors. Starting with “slips” – which factors would influence us to make such errors? Here are some:

  • Not enough practice to make the action to take “natural”
  • High levels of stress
  • Lack of sleep
  • Task overload: too many things going on at once
  • Outside disturbances (noise, people talking to you about other things)

It is not obvious that the typical open office plan favored by IT firms is the optimal layout for programmers. Workload management, work-life balance and physical working environment are important factors for avoiding such “random bugs” – and therefore also important for the security of your software.

These are mostly “trying to do the right thing but doing it wrong” type of errors. Let’s now turn to the lack of competence side of the equation. Developers have often been trained in complex problem solving – but not necessarily in protecting software from abuse. Secure coding practices, such as how to avoid SQL injection, why you need output escaping and similar types of practical application secuity knowledge, is often not gained by studying computer science. It is also likely that a more self-taught individual would have skipped over such challenges, as the natural focus is on “solving the problem at hand”. This is why a secure coding practice must deliberately be created within an organization, and training and resources provided to teams to make it work. A good baseline should include:

  • How to protect aginst OWASP Top 10 type vulnerabilities
  • Secrets management: how to protect secrets in development and production
  • Detectability of cyber threats: application logging practices

An organization with a plan for this and appropriate training to make sure everyone’s on the same page, will stand a much better chance of avoiding the “competence gap” type errors.

Security testing in the build pipeline

OK, so you have planned your software, created a threat model, commited code. The CI/CD build pipeline triggers. What’s there to stop bad code from reaching your production environment? Let’s consider the potential locations of exploitable bugs in our product:

  • My code
  • The libraries used in that code
  • The environment where my software runs (typically a container in today’s world)

Obviously, if we are trying to push something with known critical errors in either of those locations to production, our pipeline should not accept that. Starting with our own code, a standard test that can uncover many bugs is “static analysis”. Depending on the rules you use, this can be a very good security control but it has limitations. Typically it will find a hardcoded password written as

var password = 'very_secret_password";

but it may not find this password if it isn’t a little bit smart:

var tempstring = 'something_that_may_be_just_a_string";

and yet it may throw an alert on

var password = getsecret();

just because the word “password” is in there. So using the right rules, and tuning them, is important to make this work. Static analysis should be a minimum test to always include.

The next part is our dependencies. Using libraries with known vulnerabilities is a common problem that makes life easy for the adversary. This is why you should always scan the code for external libraries and check if there are known vulnerabilitie sin them. Commercial vendors of such tools often refer to it as “software component analysis”. The primary function is to list all dependencies, check them against databases of known vulnerabilities, and create alerts accordingly. And break the build process based on threshold limits.

Also the enviornment we run on should be secure. When building a container image, make sure it does not contain known vulnerabilities. Using a scanner tool for this is also a good idea.

While static analysis is primarily a build step, testing for known vulnerabilities whether in code libraries or in the environment, should be done regulary to avoid vulnerabilities discovered after the code is deployed from remaining in production over time. Testing the inventory of dependencies against a database of known vulnerabiltiies regulary would be an effective control for this type of risk.

If a library or a dependency in the environment has been injected with malicious code in the supply chain, a simple scan will not identify it. Supply chain risk management is required to keep this type of threat under control, and there are no known trustworthy methods of automatically identifying maliciously injected code in third-party dependencies in the pipeline. One principle that should be followed with respect to this type of threat, however, is minimization of the attack surface. Avoid very deep dependency trees – like an NPM project 25000 dependencies made by 21000 different contributors. Trusting 21000 strangers in your project can be a hard sell.

Another test that should preferably be part of the pipeline, is dynamic testing where actual payloads are tested against injection points. This will typically uncover other vulnerabilities than static analysis will and is thus a good addition. Note that active scanning can take down infrastructure or cause unforeseen errors, so it is a good idea to test against a staging/test environment, and not against production infrastructure.

Finally – we have the tests that will validate the mitigations identified during threat modeling. Unit tests and integration tests for securtiy controls should be added to the pipeline.

Modern environments are usually defined in YAML files (or other types of config files), not by technicians drawing cables. The benefit of this, is that the configuration can be easily tested. It is therefore a good idea to create acceptance tests for your Dockerfiles, Helm charts and other configuration files, to avoid an insider from altering it, or by mistake setting things up to be vulnerable.

Security debt has a high interest rate

Technical debt is a curious beast: if you fail to address it it will compound and likely ruin your project. The worst kind is security debt: whereas not fixing performance issues, removing dead code and so on compunds like a credit card from your bank, leaving vulnerabilities in the code compunds like interest on money you lent from Raymond Reddington. Manage your debt, or you will go out of business based on a ransomware compaign followed by a GDPR fine and some interesting media coverage…

You need to plan for time to pay off your technical debt, in particular your securiyt debt.

Say you want to plan using a certain percentage of your time in a sprint on fixing technical debt, how do you choose which issues to take? I suggest you create a simple prioritization system:

  • Exposed before internal
  • Easy to exploit before hard
  • High impact before low impact

But no matter what method you use to prioritize, the most important thing is that you work on getting rid of known vulnerbilities as part of “business-as-usual”. To avoid going bankrupt due to overwhelming technical debt. Or being hacked.

Sometimes the action you need to take to get rid of a security hole can create other problems. Like installing an update that is not compatible with your code. When this is the case, you may need to spend more resources on it than a “normal” vulnerability because you need to do code rewrites – and that refactoring may also need you to update your threat model and risk mitigations.

Operations: your code on the battle field

In production your code is exposed to its users, and in part it may also be exposed to the internet as a whole. Dealing with feedback from this jungle should be seen as a key part of your vulnerability management program.

First of all, you will get access to logs and feedback from operations, whether it is performance related, bug detections or security incidents. It is important that you feed this into your issue management system and deal with it throughout sprints. Sometimes you may even have a critical situation requiring you to push a “hotfix” – a change to the code as fast as possible. The good thing about a good pipeline is that your hotfix will still go through basic security testing. Hopefully, your agile security process and your CI/CD pipeline is now working so well in symbiosis that it doesn’t slow your hotfix down. In other words: the “hotfix” you are pushing is just a code commit like all others – you are pushing to production several times a day, so how would this be any different?

Another aspect is feedback from incident response. There are two levels of incident response feedback that we should consider:

  1. Incident containment/eradication leading to hotfixes.
  2. Security improvements from the lessons learned stage of incident response

The first part we have already considered. The second part could be improvements to detections, better logging, etc. These should go into the product backlog and be handled during the normal sprints. Don’t let lessons learned end up as a PowerPoint given to a manager – a real lesson learned ends up as a change in your code, your environment, your documentation, or in the incident response procedures themselves.

Key takeaways

This was a long post, here are the key practices to take away from it!

  • Remember that vulnerabilities come from poor operational practices, flaws in design/architecture, and from bugs (implementation errors). Linting only helps with bugs.
  • Use threat modeling to identity operational and design weaknesses
  • All errors are human errors. A good working environment helps reduce vulnerabilities (see performance shaping factors).
  • Validate mitigations using unit tests and integration tests.
  • Test your code in your pipeline.
  • Pay off technical debt religiously.

Avoid keeping sensitive info in a code repo – how to remove files from git version history

One of the vulnerabilities that are really easy to exploit is when people leave super-sensitive information in source code – and you get your hands on this source code. In early prototyping a lot of people will hardcode passwords and certificate keys in their code, and remove it later when moving to production code. Sometimes it is not even removed from production. But even in the case where you do remove it, this sensitive information can linger in your version history. What if your app is an open source app where you are sharing the code on github? You probably don’t want to share your passwords…

Key on keyboard
Don’t let bad guys get the key to your databases and other valuable files by searching old versions of your code in the repository.

Getting this sensitive info out of your repository is not as easy as deleting the file from the repo and adding it to the .gitignore file – because this does not touch your version history. What you need to do is this:

  • Merge any remote changes into your local repo, to make sure you don’t remove the work of your team if they have commited after your own last merge/commit
  • Remove the file history for your sensitive files from your local repo using the filter-branch command:

git filter-branch –force –index-filter \
‘git rm –cached –ignore-unmatch \
PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA‘ cat — –all

Although the command above looks somewhat scary it is not that hard to dig out – you can find in the the Github doc files. When that’s done, there’s only a few more things to do:

  • Add the files in question to your .gitignore file
  • Force write to the repo (git push origin –force –all)
  • Tell all your collaborator to clone the repo as a fresh start to avoid them merging in the sensitive files again

Also, if you have actually pushed sensitive info to a remote repository, particularly if it is an open source publicly available one, make sure you change all passwords and certificates that were included previously – this info should be considered compromised.


Like what you read? Sign up for free updates!

How do you update failure rates and test intervals based on limited data observations?

There is one post on this blog that consistently receives traffic from search engines; namely this post on the effect of uncertainty on PFD calculations in reliability engineering: https://safecontrols.wordpress.com/2015/07/21/uncertainty-and-effect-of-proof-test-intervals-on-failure-probabilities-of-critical-safety-functions/

072115_1313_Uncertainty2.png

It is interesting to see the effect on the dynamic probability of failure on demand from a theoretical perspective. Consider now instead the problem of collecting operational data and adjusting the test intervals to optimize uptime while keeping within the PFD constraints given by the SIL requirement. To do this in a robust manner, one must take the uncertainty in the data into account. We are seeking to solve this problem:

In other words; maximize the test interval while keeping the upper confidence bound on the average value of the PFD above the set value C, given that the standard deviation of the rate of dangerous undetected failures is known. To make things more practical, we consider a simple SIL loop where the PFD value is dominated by the final element. We make the simplification, for the sake of the calculation, that a single component is the loop. Let us then assume we have 20 valves of the same type that have operated over an aggregated 400 000 hours, and we have a theoretical failure rate of 10-6 per hour for these valves. We have not had any real demand trips, and the original test frequency was once per year. Testing has revealed that one valve had a dangerous failure in its first year of operation. Can we use this to extend the test interval without increasing the risk to our assets?

A naïve estimate the failure rate based on our observations indicate a failure rate of 1.25 x 10-6, which is obviously better than the a posteriori estimate from the design data. However, the design data is based on a larger data set and should not be disregarded if we wish to be reasonably sure about our decisions. So, the expected mean time to failure would be somewhere between 114 years and 913 years – a significant difference. SINTEF has released a report that gives a simplified approach to updating the failure rate. This approach requires you to define a conservative estimate of the failure rate based on the a priori data – often chosen to be the double of the original failure rate: λDU_CE = 2 λDU. Uncertainty parameters are then calculated based on the Gamma distribution as

Then the combined (updated) failure rate estimate is given as

where is the number of dangerous failures observed, and is the aggregate operational time. Using this on our example gives us

What is going on here – the combined failure rate is higher than the a priori? The expected number of failures in 400.000 hours with an a priori MTTF of 1 million hours is clearly less than 1 – and we had one failure. So the estimate is sound. SINTEF’s methodology will give you lots more details, including credibility intervals for the Bayesian updates.

So – now to the test intervals – if the new combined failure rate is accepted – we should probably test more often, right? It depends, SINTEF argues that it is important to be conservative when updating test intervals to make sure insufficient data do not lead us astray. They propose the following simple rule:

If the new failure rate is less than half of the original failure rate, and the upper 90% confidence bound on the new failure rate is lower than the a priori failure rate, the test interval can be doubled.

If the failure rate is more than double the original failure rate, and the lower 90% confidence bound on the new failure rate is higher than the a priori failure rate, the test interval can be halved (e.g. from one year to every 6 months).

This means that in our case – the test interval stays the way it is.

Thinking about risk through methods

Risk management is  a topic with a large number of methods. Within the process industries, semi-quantitative methods are popular, in particular for determining required SIL for safety instrumented functions (automatic shutdowns, etc.). Two common approaches are known as LOPA, which is short for “layers of protection analysis” and Riskgraph. These methods are sometimes treated as “holy” by practicioners, but truth is that they are merely coginitive aids in sorting through our thinking about risks.

 

13768080_1067660556651700_1098663203_n
Riskgraph #sliderule – methods are formalisms. See picture on Instagram

 

In short, our risk assessment process consists of a series of steps here:

  • Identify risk scenarios
  • Find out what can reduce the risk that you have in place, like design features and procedures
  • Determine what the potential consequences of the scenario at hand is, e.g. worker fatalities or a major environmental disaster
  • Make an estimate of how likely or credible you think it is that the risk scenario should occur
  • Consider how much you trust the existing barriers to do the job
  • Determine how trustworthy your new barrier must be for the situation to be acceptable

Several of these bullet points can be very difficult tasks alone, and putting together a risk picture that allows you to make sane decisions is hard work. That’s why we lean on methods, to help us make sense of the mess that discussions about risk typically lead to.

Consequences can be hard to gauge, and one bad situation may lead to a set of different outcomes. Think about the risk of “falling asleep while driving a car”. Both of these are valid consequences that may occur:

  • You drive off the road and crash in the ditch – moderate to serious injuries
  • You steer the car into the wrong lane and crash head-on with a truck – instant death

Should you think about both, or pick one of them, or another consequence not on this list? In many “barrier design” cases the designer chooses to design for the worst-case credible consequence. It may be difficult to judge what is really credible, and what is truly the worst-case. And is this approach sound if the worst-case is credible but still quite unlikeley, while at the same time you have relatively likely scenarios with less serious outcomes? If you use a method like LOPA or RiskGraph, you may very well have a statement in your method description to always use the worst-case consequence. A bit of judgment and common sense is still a good idea.

Another difficult topic is probability, or credibility. How likely is it that an initiating event should occur, and what is the initating event in the first place? If you are the driver of the car, is “falling asleep behind the wheel” the initating event? Let’s say it is. You can definitely find statistics on how often people fall asleep behind the wheel. The key question is, is this applicable to the situation at hand? Are data from other countries applicable? Maybe not, if they have different road standards, different requirements for getting a driver’s license, etc. Personal or local factors can also influence the probability. In the case of the driver falling asleep, the probabilities would be influenced by his or her health, stress levels, maintenance of the car, etc. Bottom line is, also the estimate of probability will be a judgment call in most cases. If you are lucky enough to have statistical data to lean on, make sure you validate that the data are representative for your situation.Good method descriptions should also give guidance on how to do these judgment calls.

Most risks you identify already have some risk reducing barrier elements. These can be things like alarms and operating procedures, and other means to reduce the likelihood or consequence of escalation of the scenario. Determining how much you are willing to rely on these other barriers is key to setting a requirement on your safety function of interest – typically a SIL rating. Standards limit how much you can trust certain types of safeguards, but also here there will be some judgment involved. Key questions are:

  • Are multiple safeguards really independent, such that the same type of failure cannot know out multiple defenses at once?
  • How much trust can you put in each safeguard?
  • Are there situations where the safeguards are less trustworthy, e.g. if there are only summer interns available to handle a serious situation that requires experience and leadership?

Risk assessmen methods are helpful but don’t forget that you make a lot of assumptions when you use them. Don’t forget to question your assumptions even if you use a recognized method, especially not if somebody’s life will depend on your decision.

How do digital control systems get developed?

Digital control systems control almost every piece of technology we use, from the thermostat in your fridge to oil refineries and self-driving cars. My answer to this Quora user’s question suggests an iterative process involving:

  • setting objectives and goals
  • modeling the plant
  • designing the control structure
  • testing and simulation studies
  • testing on the real plant
  • maintenance during operations

The important thing here is not to think of it as a linear workflow; you will jump back in the process and redo stuff several times. For some unknown reason, universities tend to focus only on the modeling and simulation study part. The world would be a more reliable place if designers were taught to think about the whole process and the whole lifecycle from the start instead of waiting for experience to sink in.

Updating failure rates based on operational data – are we fooling ourselves again?

Failure rates for critical components are difficult to trust. Basically, if we look at public sources for data, such as the OREDA handbook, we observe that typical components have very wide confidence intervals for estimated failure rates, in spite of 30 years of collecting these data. If we look at the data supplied by vendors, they simply avoid saying anything about the spread or uncertainty in their data. Common practice today is to measure SIL compliance based on vendor supplied data, after a sanity check by the analyst. The sanity check usually consists of comparing with other data sources, and basically looking for completely ridiculous reliability claims. When coming to the operational phase it then becomes interesting to compare actual performance with the promised performance of the vendor. Typically, the actual performance is 10 to 100 times worse than promised. Because these components provide important parts of the barriers against terrible accidents, operators are also interested in measuring the actual integrity of these barriers. One possible method for doing this is given in the SINTEF report A8788, called “Guidelines for follow-up of safety instrumented systems (SIS) in operations”. For details, you should read that report but here’s the basics of what the report recommends for updating failure rates:

  • If you have more than 3 million operating hours for a particular type of item, you can calculate the expected failure rate as “failures observed divided by number of hours of operation” and give a confidence interval based on the chi-square-distribution
  • If you have less than 3 million operating hours but you have observed dangerous undetected failures (most likely during testing) you can combine the a priori design failure rate with an operational knowledge.

Let’s look at how to combine failure rates: first, you have to give some conservative estimate of the “true failure rate” as a basis for the combination. This is used to say something about the uncertainty in the original estimate. From OREDA one can observe that the upper 90% confidence bound is often around 2 times the failure rate value, so if no better estimate is available, use this. For very low reliability claims, use 5 x 10-7 (lower than that seems too optimistic in most cases). Then calculate the following parameters:

where λDU-CE is the conservative estimate and λDU is the design failure rate. Then, the combined failure rate can be estimated as

where is the number of similar items and is the number operational hours.

The SINTEF method does not give formulas for a confidence bound for the combined rate, but we may assume this will be between zero and the conservative estimate (which does not tell us too much, really). For the rate based on pure operational data we can use standard formulas for this. Consider now a case with about 75 transmitters with a design failure rate of 5 x 10-7 failures per hour. Over a 30-year simulated operational period we would expect approximately 10 failures. Injecting 10 failures at random intervals yields interesting results in a simulated case:

Note that up to 3 million operational hours we have assumed the design rate (PDS value) is governing uncertainty. Note also for infrequent failures, the confidence bands and the failure rate estimated is heavily influenced by each failure observation. We should thus be very careful with updating operational practices directly based on a few failure observations.

Solving the fragmentation problem in documentation of reliability

Reliability standards require that suppliers of components that will be used as parts in a safety function or a safety instrumented system shall be documented to show full compliance with the reliability requirements. In practice, however, documentation is often severely lacking. In essence, the documentation required for a given component would include:

  • A description of how the component will be used in the safety function, the SIS, and which barrier functions it will support
  • A description of failure rate data and calculations to show that performance is satisfactory as measured against PFD or PFH requirements for the given SIL requirement on the function the component is a part of.
  • A description of systematic capability and under which architectures the component can be used for a given SIL requirement
  • A description of software assurance to satisfy relevant requirements
  • A description of quality management and how the vendor works to avoid systematic failures

Collecting pieces of fragmented reliability information can be a tedious and painful exercise – however, not using available information may be worse for the project as a whole than accepting that things are not going to be perfect.

In many cases one or more pieces of this documentation is missing. However, the same component can be part of many deliverables; for example, a pressure transmitter may be part of various packages delivered by multiple package vendors. In some cases, these vendors deliver bits and pieces of relevant reliability documentation, that by chance would cover all of the relevant aspects. In this case, there is enough proof that the component can perform its function as part of the SIS, provided all relevant configurations are covered. In such cases, should we allow such fragmented documentation?

The principle answer would be “NO”. One reason for this is that traceability from requirement to tag number to vendor deliverable and vendor documentation will be lost. In practice, however, we are not left with much choice. If the component is acceptable to use, we should of course use it. Traceability is, however, important in reliability projects. The system integrator should thus make a summary of the documentation with pointers to where each piece of documentation is coming from. This solves the traceability problem. However, we should also take care to educate the entire value chain on the needed documentation, to ensure sufficient traceability and to allow for assurance and verification activities without resorting to hunting for bits and pieces of fragmented information about each component. We should therefore put equal weight into:

  1. Ensuring our components are of sufficient quality and proven reliability for use in the SIS
  2. Influencing our value chain to focus on continuous improvement and correct documentation in projects

Why functional safety audits are useful

Functional safety work usually involves a lot of people, and multiple organizations. One key success factor for design and operation of safety instrumented systems is the competence of the people involved in the safety lifecycle. In practice, when activities have been omitted, or the quality of the work is not acceptable, this is discovered in the functional safety assessment towards the end of the project, or worse, it is not discovered at all. The result of too low quality is lower integrity of the SIS as a barrier, and thus higher risk to people, assets and environment – without the asset owner being aware of this! Obviously a bad situation.

Knowing how to do your job is important in all phases of the safety lifecycle. Functional safety audits can be an important tool for verification – and for motivating organization’s to maintain good competence mangement systems for all relevant roles.

Competence management is a key part of functional safety management. In spite of this, many companies have less than desirable track records in this field. This may be due to ignorance, or maybe because some organizations view a «SIL» as a marketing designation rather than a real risk reduction measure. Either way – such a situation is unacceptable. One key tool for ensuring everybody involved understands what their responsibilities are, and makes an effort to learn what they need to know to actually secure the necessary system level integrity, is the use of functional safety audits. An auditing program should be available in all functional safety projects, with at least the following aspects:

  • A procedure for functional safety audits should exist
  • An internal auditing program should exist within each company involved in the safety lifecycle
  • Vendor auditing should be used to make sure suppliers are complying with functional safety requirements
  • All auditing programs should include aspects related to document control, management of change and competence management

Constructive auditing can be an invaluable part of building a positive organizational culture – where quality becomes as important to every function involved in the value chain – from the sales rep to the R&D engineer.

One day statements like “please take the chapter on competence out of the management plan, we don’t want any difficult questions about systems we do not have” may seem like an impossible absurdity.

How does your approach to maintenance change the reliability of your system?

Everybody knows that you need to maintain your stuff to make sure it is in top condition; a colloquial way of saying that maintenance is a necessary component of any highly reliable system. In practice, there are three quite common approaches to maintenance of technical systems:

  • If it ain’t broke don’t fix it
  • Periodic inspection and maintenance
  • Risk based maintenance

The first policy is the most commonly applied in private life. The only reason this is at all possible is a combination of high labor cost, relatively cheap mass production of many things, and that insurance companies allow you to do it by paying for stuff that “accidentally” breaks. For reliability, this policy is clearly the least reasonable, and it will negatively affect the reliability of your system. The following picture shows an example of the degradation that may occur when following this policy – taken from a walk in the Danish countryside.

The second policy is maybe the most commonly applied in industry – and for conscious car owners. You take your car to the mechanic once per year, or perhaps every 25000 km, for oil change and a check-up. You test the most important things, and follow up if necessary. This ensures your reliability is not negatively affected by lack of maintenance – but still the work has to be performed with high quality. Maintenance done right reduces probability of severe failures, done wrong it is outright dangerous.

Risk based maintenance uses available data to judge the risk level for your equipment – by using a combination of information gained from inspections and direct measurements. When you actively monitor the risk of failures and plan maintenance activities in accordance with real risks, you can prolong the useful life of your system while maintaining the necessary reliability.

When assessing the reliability of critical systems you should thus include both the philosophy of maintenance and the ability of the organization to execute the maintenance plan in a good way in your assessment. A simple illustration of this is shown in the below picture.

Follow-up of the supply chain for SIL rated equipment

P

Procurement is easy – getting exactly what you need is not. I have previously discussed the challenges related to follow-up of suppliers of SIL rated equipment on this blog, but that was from the perspective of an organization. This time, let’s look at what this means for you, if you are either the purchaser or the package engineer responsible for the deliverable. Basically there are three challenges related to communication in procurement of SIL rated equipment – or procurement of anything for that matter;

  • The purchaser does not understand what the project needs
  • The customer does not understand what the purchaser needs
  • The package engineer does not know that the purchaser does not know what the project needs, and therefore he or she does also not know that the supplier does not know what the project actually needs

This, of course, is recipe for a lot of quarreling and time wasted on finger pointing and the blame game. All of this is expensive, frustrating and useless. What can we do to avoid this problem in the first place? First, everybody needs to know a few basic things about SIL. The standards used in industry are quite heavy reading, and when guidelines for your industry are available, it is a good idea to use them. For the oil and gas industry, the Norwegian Oil & Gas Association’s Guideline No. 070 is a very good starting point. To distill it down to a bare minimum, the following concepts should be known to all purchasers and package engineers:

  • Why does a safety integrity level requirement exist for the function your equipment is a part of?
  • What is a safety integrity level (SIL) in terms of:
    • Quantitative requirements (PFD quota for the equipment)
    • Architectural requirements (hardware fault tolerance, safe failure fraction, etc.)
    • Software requirements
    • Qualitative requirements
  • What are the basic documentation requirements?

When this is known, communication between purchaser and supplier becomes much easier. It also becomes easier for the package engineer and the purchaser to discuss follow-up of the vendors and what requirements should be put in the purchase order, as well as in the request for proposal. Most projects will develop a lot of functional safety documents. Two of the most important ones in the purchasing process are:

  • Safety Requirement Specification (SRS): In this document you find a description of the function your component is a part of, and the SIL requirements to the function. You will also find allocated PFD quotas to each component in the function – this is an important number to use in the purchasing process.
  • A “Vendor guideline for Safety Analysis Reports” or a “Safety Manual Guideline” describing the project’s documentation requirements for SIL rated equipment

So, what can you do to bring things into this nice and orderly state? If you are a purchaser, take a brief SIL primer, or preferably, ask your company’s functional safety person to give you a quick introduction. Then talk to your package engineer about this things when setting out the ITT. If you are a package engineer, invite your purchaser for a coffee, to discuss the needs of the project in terms of these things. If the purchaser does not understand the terminology, be patient and explain. And remember that not everybody has the right background; the engineer may fail to understand some technical details of the purchasing function, and the purchaser may not understand the inner workings of your compressor – but aiming for a common platform to discuss requirements and follow-up of vendors will make life easier for both of you.