One of the vulnerabilities that are really easy to exploit is when people leave super-sensitive information in source code – and you get your hands on this source code. In early prototyping a lot of people will hardcode passwords and certificate keys in their code, and remove it later when moving to production code. Sometimes it is not even removed from production. But even in the case where you do remove it, this sensitive information can linger in your version history. What if your app is an open source app where you are sharing the code on github? You probably don’t want to share your passwords…
Getting this sensitive info out of your repository is not as easy as deleting the file from the repo and adding it to the .gitignore file – because this does not touch your version history. What you need to do is this:
Merge any remote changes into your local repo, to make sure you don’t remove the work of your team if they have commited after your own last merge/commit
Remove the file history for your sensitive files from your local repo using the filter-branch command:
Although the command above looks somewhat scary it is not that hard to dig out – you can find in the the Github doc files. When that’s done, there’s only a few more things to do:
Add the files in question to your .gitignore file
Force write to the repo (git push origin –force –all)
Tell all your collaborator to clone the repo as a fresh start to avoid them merging in the sensitive files again
Also, if you have actually pushed sensitive info to a remote repository, particularly if it is an open source publicly available one, make sure you change all passwords and certificates that were included previously – this info should be considered compromised.
It is interesting to see the effect on the dynamic probability of failure on demand from a theoretical perspective. Consider now instead the problem of collecting operational data and adjusting the test intervals to optimize uptime while keeping within the PFD constraints given by the SIL requirement. To do this in a robust manner, one must take the uncertainty in the data into account. We are seeking to solve this problem:
In other words; maximize the test interval while keeping the upper confidence bound on the average value of the PFD above the set value C, given that the standard deviation of the rate of dangerous undetected failures is known. To make things more practical, we consider a simple SIL loop where the PFD value is dominated by the final element. We make the simplification, for the sake of the calculation, that a single component is the loop. Let us then assume we have 20 valves of the same type that have operated over an aggregated 400 000 hours, and we have a theoretical failure rate of 10-6 per hour for these valves. We have not had any real demand trips, and the original test frequency was once per year. Testing has revealed that one valve had a dangerous failure in its first year of operation. Can we use this to extend the test interval without increasing the risk to our assets?
A naïve estimate the failure rate based on our observations indicate a failure rate of 1.25 x 10-6, which is obviously better than the a posteriori estimate from the design data. However, the design data is based on a larger data set and should not be disregarded if we wish to be reasonably sure about our decisions. So, the expected mean time to failure would be somewhere between 114 years and 913 years – a significant difference. SINTEF has released a report that gives a simplified approach to updating the failure rate. This approach requires you to define a conservative estimate of the failure rate based on the a priori data – often chosen to be the double of the original failure rate: λDU_CE = 2 λDU. Uncertainty parameters are then calculated based on the Gamma distribution as
Then the combined (updated) failure rate estimate is given as
where is the number of dangerous failures observed, and is the aggregate operational time. Using this on our example gives us
What is going on here – the combined failure rate is higher than the a priori? The expected number of failures in 400.000 hours with an a priori MTTF of 1 million hours is clearly less than 1 – and we had one failure. So the estimate is sound. SINTEF’s methodology will give you lots more details, including credibility intervals for the Bayesian updates.
So – now to the test intervals – if the new combined failure rate is accepted – we should probably test more often, right? It depends, SINTEF argues that it is important to be conservative when updating test intervals to make sure insufficient data do not lead us astray. They propose the following simple rule:
If the new failure rate is less than half of the original failure rate, and the upper 90% confidence bound on the new failure rate is lower than the a priori failure rate, the test interval can be doubled.
If the failure rate is more than double the original failure rate, and the lower 90% confidence bound on the new failure rate is higher than the a priori failure rate, the test interval can be halved (e.g. from one year to every 6 months).
This means that in our case – the test interval stays the way it is.
Risk management is a topic with a large number of methods. Within the process industries, semi-quantitative methods are popular, in particular for determining required SIL for safety instrumented functions (automatic shutdowns, etc.). Two common approaches are known as LOPA, which is short for “layers of protection analysis” and Riskgraph. These methods are sometimes treated as “holy” by practicioners, but truth is that they are merely coginitive aids in sorting through our thinking about risks.
In short, our risk assessment process consists of a series of steps here:
Identify risk scenarios
Find out what can reduce the risk that you have in place, like design features and procedures
Determine what the potential consequences of the scenario at hand is, e.g. worker fatalities or a major environmental disaster
Make an estimate of how likely or credible you think it is that the risk scenario should occur
Consider how much you trust the existing barriers to do the job
Determine how trustworthy your new barrier must be for the situation to be acceptable
Several of these bullet points can be very difficult tasks alone, and putting together a risk picture that allows you to make sane decisions is hard work. That’s why we lean on methods, to help us make sense of the mess that discussions about risk typically lead to.
Consequences can be hard to gauge, and one bad situation may lead to a set of different outcomes. Think about the risk of “falling asleep while driving a car”. Both of these are valid consequences that may occur:
You drive off the road and crash in the ditch – moderate to serious injuries
You steer the car into the wrong lane and crash head-on with a truck – instant death
Should you think about both, or pick one of them, or another consequence not on this list? In many “barrier design” cases the designer chooses to design for the worst-case credible consequence. It may be difficult to judge what is really credible, and what is truly the worst-case. And is this approach sound if the worst-case is credible but still quite unlikeley, while at the same time you have relatively likely scenarios with less serious outcomes? If you use a method like LOPA or RiskGraph, you may very well have a statement in your method description to always use the worst-case consequence. A bit of judgment and common sense is still a good idea.
Another difficult topic is probability, or credibility. How likely is it that an initiating event should occur, and what is the initating event in the first place? If you are the driver of the car, is “falling asleep behind the wheel” the initating event? Let’s say it is. You can definitely find statistics on how often people fall asleep behind the wheel. The key question is, is this applicable to the situation at hand? Are data from other countries applicable? Maybe not, if they have different road standards, different requirements for getting a driver’s license, etc. Personal or local factors can also influence the probability. In the case of the driver falling asleep, the probabilities would be influenced by his or her health, stress levels, maintenance of the car, etc. Bottom line is, also the estimate of probability will be a judgment call in most cases. If you are lucky enough to have statistical data to lean on, make sure you validate that the data are representative for your situation.Good method descriptions should also give guidance on how to do these judgment calls.
Most risks you identify already have some risk reducing barrier elements. These can be things like alarms and operating procedures, and other means to reduce the likelihood or consequence of escalation of the scenario. Determining how much you are willing to rely on these other barriers is key to setting a requirement on your safety function of interest – typically a SIL rating. Standards limit how much you can trust certain types of safeguards, but also here there will be some judgment involved. Key questions are:
Are multiple safeguards really independent, such that the same type of failure cannot know out multiple defenses at once?
How much trust can you put in each safeguard?
Are there situations where the safeguards are less trustworthy, e.g. if there are only summer interns available to handle a serious situation that requires experience and leadership?
Risk assessmen methods are helpful but don’t forget that you make a lot of assumptions when you use them. Don’t forget to question your assumptions even if you use a recognized method, especially not if somebody’s life will depend on your decision.
Digital control systems control almost every piece of technology we use, from the thermostat in your fridge to oil refineries and self-driving cars. My answer to this Quora user’s question suggests an iterative process involving:
setting objectives and goals
modeling the plant
designing the control structure
testing and simulation studies
testing on the real plant
maintenance during operations
The important thing here is not to think of it as a linear workflow; you will jump back in the process and redo stuff several times. For some unknown reason, universities tend to focus only on the modeling and simulation study part. The world would be a more reliable place if designers were taught to think about the whole process and the whole lifecycle from the start instead of waiting for experience to sink in.
Failure rates for critical components are difficult to trust. Basically, if we look at public sources for data, such as the OREDA handbook, we observe that typical components have very wide confidence intervals for estimated failure rates, in spite of 30 years of collecting these data. If we look at the data supplied by vendors, they simply avoid saying anything about the spread or uncertainty in their data. Common practice today is to measure SIL compliance based on vendor supplied data, after a sanity check by the analyst. The sanity check usually consists of comparing with other data sources, and basically looking for completely ridiculous reliability claims. When coming to the operational phase it then becomes interesting to compare actual performance with the promised performance of the vendor. Typically, the actual performance is 10 to 100 times worse than promised. Because these components provide important parts of the barriers against terrible accidents, operators are also interested in measuring the actual integrity of these barriers. One possible method for doing this is given in the SINTEF report A8788, called “Guidelines for follow-up of safety instrumented systems (SIS) in operations”. For details, you should read that report but here’s the basics of what the report recommends for updating failure rates:
If you have more than 3 million operating hours for a particular type of item, you can calculate the expected failure rate as “failures observed divided by number of hours of operation” and give a confidence interval based on the chi-square-distribution
If you have less than 3 million operating hours but you have observed dangerous undetected failures (most likely during testing) you can combine the a priori design failure rate with an operational knowledge.
Let’s look at how to combine failure rates: first, you have to give some conservative estimate of the “true failure rate” as a basis for the combination. This is used to say something about the uncertainty in the original estimate. From OREDA one can observe that the upper 90% confidence bound is often around 2 times the failure rate value, so if no better estimate is available, use this. For very low reliability claims, use 5 x 10-7 (lower than that seems too optimistic in most cases). Then calculate the following parameters:
where λDU-CE is the conservative estimate and λDU is the design failure rate. Then, the combined failure rate can be estimated as
where is the number of similar items and is the number operational hours.
The SINTEF method does not give formulas for a confidence bound for the combined rate, but we may assume this will be between zero and the conservative estimate (which does not tell us too much, really). For the rate based on pure operational data we can use standard formulas for this. Consider now a case with about 75 transmitters with a design failure rate of 5 x 10-7 failures per hour. Over a 30-year simulated operational period we would expect approximately 10 failures. Injecting 10 failures at random intervals yields interesting results in a simulated case:
Note that up to 3 million operational hours we have assumed the design rate (PDS value) is governing uncertainty. Note also for infrequent failures, the confidence bands and the failure rate estimated is heavily influenced by each failure observation. We should thus be very careful with updating operational practices directly based on a few failure observations.
Reliability standards require that suppliers of components that will be used as parts in a safety function or a safety instrumented system shall be documented to show full compliance with the reliability requirements. In practice, however, documentation is often severely lacking. In essence, the documentation required for a given component would include:
A description of how the component will be used in the safety function, the SIS, and which barrier functions it will support
A description of failure rate data and calculations to show that performance is satisfactory as measured against PFD or PFH requirements for the given SIL requirement on the function the component is a part of.
A description of systematic capability and under which architectures the component can be used for a given SIL requirement
A description of software assurance to satisfy relevant requirements
A description of quality management and how the vendor works to avoid systematic failures
In many cases one or more pieces of this documentation is missing. However, the same component can be part of many deliverables; for example, a pressure transmitter may be part of various packages delivered by multiple package vendors. In some cases, these vendors deliver bits and pieces of relevant reliability documentation, that by chance would cover all of the relevant aspects. In this case, there is enough proof that the component can perform its function as part of the SIS, provided all relevant configurations are covered. In such cases, should we allow such fragmented documentation?
The principle answer would be “NO”. One reason for this is that traceability from requirement to tag number to vendor deliverable and vendor documentation will be lost. In practice, however, we are not left with much choice. If the component is acceptable to use, we should of course use it. Traceability is, however, important in reliability projects. The system integrator should thus make a summary of the documentation with pointers to where each piece of documentation is coming from. This solves the traceability problem. However, we should also take care to educate the entire value chain on the needed documentation, to ensure sufficient traceability and to allow for assurance and verification activities without resorting to hunting for bits and pieces of fragmented information about each component. We should therefore put equal weight into:
Ensuring our components are of sufficient quality and proven reliability for use in the SIS
Influencing our value chain to focus on continuous improvement and correct documentation in projects
Functional safety work usually involves a lot of people, and multiple organizations. One key success factor for design and operation of safety instrumented systems is the competence of the people involved in the safety lifecycle. In practice, when activities have been omitted, or the quality of the work is not acceptable, this is discovered in the functional safety assessment towards the end of the project, or worse, it is not discovered at all. The result of too low quality is lower integrity of the SIS as a barrier, and thus higher risk to people, assets and environment – without the asset owner being aware of this! Obviously a bad situation.
Competence management is a key part of functional safety management. In spite of this, many companies have less than desirable track records in this field. This may be due to ignorance, or maybe because some organizations view a «SIL» as a marketing designation rather than a real risk reduction measure. Either way – such a situation is unacceptable. One key tool for ensuring everybody involved understands what their responsibilities are, and makes an effort to learn what they need to know to actually secure the necessary system level integrity, is the use of functional safety audits. An auditing program should be available in all functional safety projects, with at least the following aspects:
A procedure for functional safety audits should exist
An internal auditing program should exist within each company involved in the safety lifecycle
Vendor auditing should be used to make sure suppliers are complying with functional safety requirements
All auditing programs should include aspects related to document control, management of change and competence management
Constructive auditing can be an invaluable part of building a positive organizational culture – where quality becomes as important to every function involved in the value chain – from the sales rep to the R&D engineer.
One day statements like “please take the chapter on competence out of the management plan, we don’t want any difficult questions about systems we do not have” may seem like an impossible absurdity.