Top of the iceberg: politicians’ private email accounts and shadow IT

In CISO circles the term “shadow IT” is commonly used for when employees use private accounts, devices and networks to conduct work outside of the company’s IT policies. People often do this because they feel they don’t have the freedom to get the job done within the rules.

071615_1406_Treatyourpe1.jpg
If you deny your people a well-stacked toolbox, they will bring their own. That may not be the best solution for your security. 

This is not only for low-level clerks and helpdesk ninjas: top level managers are known to do this a lot, including politicians. Hillary Clinton probably lost the presidential election at least partially due to her poor security awareness. Now VP Mike Pence has also been outed as “private email wielding pubic servant” – and he was hacked too. Why do people do this?

Reasons why people do their business in the IT shadows

I’ll nominate 3 main reasons why people tend to use private and unauthorized tools and services in companies and public service. Then let’s look at what we can do about it, because this is a serious expansion of the organization’s attack surface! And we don’t want that, do we?

I believe (based on experience) the 3 main reasons are:

  1. The tools they are provided with are hard to use, impractical or not available
  2. They do not understand the security implications and have not internalized what secure behaviors really are
  3. The always-on culture is making the distinction between “work” and “personal” foggy; people don’t see that risks they are willing to take in their personal lives are also affecting their organizations that typically will have a completely different risk context

How to avoid the shadow IT rabbit hole of vulnerabilities

First of all, don’t treat your employees and co-workers are idiots. IT security is very often about locking everything down and hardening machines and services. If you go too far in this direction you make it very hard for people to do their jobs, and you can end of driving them into the far riskier practices of inventing their own workarounds using unauthorized solutions – like private email accounts. Make sure controls are balanced, and don’t forget that security is there to protect productivity – not as the key product of most organizations. Therefore, your risk governance must ensure:

  • Select risk-based controls – don’t lock everything down by default
  • Provide your employees with the solutions they need to do their jobs
  • Remember that no matter how much you harden your servers, the human factor still remains.

Second, make people your most important security assets. Build a security aware culture. This has to be done by training, by leadership and by grassroots engagement in your organization.

Third, and for now last, disconnect. Allow people to disconnect. Encourage it. Introduce separations between the private and what is work or for your organization. This is important because the threat contexts of the private sphere and the organizational sphere are in most cases very different. This is also the most difficult part of the management equation: allowing flexible work but ensuring there is a divide between “work” and “life”. This is what work-life balance means for security; it allows people to maintain different contexts for different parts of their lives.

 

Users: from threats to security enhancers?

Security should be an organization-wide effort. That means getting everyone to play the same game, requiring IT to stop thinking about “users as internal threats”, and start instead to think about “internal customers as security enhancers”. This can only be achived by using balanced security measures, involving the internal customers (users) through sharing the risk picture, and putting risk based thinking behind security planning to drive rational and balanced decisions. For many organiations with a pure compliance focus this can be a challenging journey – but the rewards at the end is an organization better equipped to tackle a dynamic threat landscape. 

Users have traditionally been seen as a major threat in information security settings. The insider threat is a very real thing but this does not mean that the user is the threat as such. There has recently been much discussion about how we can achieve a higher degree of cybersecurity maturity in organizations, and whether cybersecurity awareness training really works. This post does not give you the answers but describes some downsides to the compliance oriented tradition. The challenge is to find a good balance between controls and compliance on one side, and driving a positive security culture on the other.

wp-image-10420430jpg.jpg
Locking down users’ tools too much may enhance security on paper – until you you consider the effect of trust erosion on the human attack surface. When knowledge workers feel they are not trusted by their organization, they also feel undervalued, something that can create hostility towards management in general, and cybersecurity policies specifically. There is no fun working in an environment where all the toys are locked down. 

Most accidents involve “human error” as part of the accident chain, pretty much like most security breaches also involve some form of human error, typically a user failing to spot a social engineering attempt where the security technology is also inept at making good protection decisions. Email is still the most common malware delivery method, and phishing would not work without humans on the other end. This picture is what your security department is used to seeing; the user performs some action that allows the attacker to penetrate the organization. Hence, the user is a threat. The cure for this is supposed to be cyberscurity awareness training teaching users not to open attachments from sketchy sources, not to click those links, not to use weak passwords and so on. The problem is just that this only partially works. Some people have even gone so far as to say that this is completely useless.

The other part of the story is the user that reports his or her computer is misbehaving, or that some resoures have become unavailable, or forwards spear-phishing attempts. Those users are complying with policy and allowing the organization to spot potential attempts of recon or attack before the fact, or at least realtively soon after a breach. These users are security enhancers, in the way security awareness training is trying to at least make users a little bit less dangerous.

Because people do risky things when possible, the typical IT department answer to the insider threat is to lock down every workstation as much as possible, to “harden it”, ie making the attack surface smaller. This attack surface view, however, only considers the technology, not the social component. If you lock down the systems more than what is felt necessary by the users, they will probably start opposing company policies. They will not be reporting suspicious activities as often anymore. They will go through the motions of your awareness training but little behavioral change is seen afterwards. You risk that shadow IT starts to take a hold of your business – that employees use their private cloud accounts, portable apps or private computers to do their jobs – because the tools they feel they need to do their jobs are locked down, made inflexible or simply unavailable by the IT department in order to “reduce the attack surface”. So, not only are you risking to prime your employees for social engineering attacks (angry employees are easier to manipulate), making your staff less able to benefit from your training courses, but you may also be significantly increasing the technical attack surface through shadow IT.

So what is the solution – to allow users to do whatever they want on the network, give the admin rights and no controls? Obviously a bad idea. Keywords are balanced measures, involvement and risk based thinking.

  • Balanced: there must be a balance between security and productivity. A full lockdown may be required for information objects of high value to the firm and with credible attack scenarios, but not every piece of data and every operation is in that category.
  • Involvement: people need to understand why security measures are in place to make sense of the measures. Most security measures are impractical to people just wanting to get the job done. Understanding the implications of a breach and the cost-benefit ratio of the measures in place greatly helps people motivate themselves to do what feels slightly impractical.
  • Risk based thinking: measures must be adequate to the risk posed to the organization and not exaggerated. The risk picture must be shared with the employees as part of the security communication – this is a core leadership responsibility and the foundation of security aware cultures.

In the end it comes down to respect. Respect other people for what they do, and what value they bring to the organization. Think of them as customers instead of users. Only drug dealers and IT departments refer to their customers as users (quoted from somewhere forgotten on the internet).

Thinking about risk through methods

Risk management is  a topic with a large number of methods. Within the process industries, semi-quantitative methods are popular, in particular for determining required SIL for safety instrumented functions (automatic shutdowns, etc.). Two common approaches are known as LOPA, which is short for “layers of protection analysis” and Riskgraph. These methods are sometimes treated as “holy” by practicioners, but truth is that they are merely coginitive aids in sorting through our thinking about risks.

 

13768080_1067660556651700_1098663203_n
Riskgraph #sliderule – methods are formalisms. See picture on Instagram

 

In short, our risk assessment process consists of a series of steps here:

  • Identify risk scenarios
  • Find out what can reduce the risk that you have in place, like design features and procedures
  • Determine what the potential consequences of the scenario at hand is, e.g. worker fatalities or a major environmental disaster
  • Make an estimate of how likely or credible you think it is that the risk scenario should occur
  • Consider how much you trust the existing barriers to do the job
  • Determine how trustworthy your new barrier must be for the situation to be acceptable

Several of these bullet points can be very difficult tasks alone, and putting together a risk picture that allows you to make sane decisions is hard work. That’s why we lean on methods, to help us make sense of the mess that discussions about risk typically lead to.

Consequences can be hard to gauge, and one bad situation may lead to a set of different outcomes. Think about the risk of “falling asleep while driving a car”. Both of these are valid consequences that may occur:

  • You drive off the road and crash in the ditch – moderate to serious injuries
  • You steer the car into the wrong lane and crash head-on with a truck – instant death

Should you think about both, or pick one of them, or another consequence not on this list? In many “barrier design” cases the designer chooses to design for the worst-case credible consequence. It may be difficult to judge what is really credible, and what is truly the worst-case. And is this approach sound if the worst-case is credible but still quite unlikeley, while at the same time you have relatively likely scenarios with less serious outcomes? If you use a method like LOPA or RiskGraph, you may very well have a statement in your method description to always use the worst-case consequence. A bit of judgment and common sense is still a good idea.

Another difficult topic is probability, or credibility. How likely is it that an initiating event should occur, and what is the initating event in the first place? If you are the driver of the car, is “falling asleep behind the wheel” the initating event? Let’s say it is. You can definitely find statistics on how often people fall asleep behind the wheel. The key question is, is this applicable to the situation at hand? Are data from other countries applicable? Maybe not, if they have different road standards, different requirements for getting a driver’s license, etc. Personal or local factors can also influence the probability. In the case of the driver falling asleep, the probabilities would be influenced by his or her health, stress levels, maintenance of the car, etc. Bottom line is, also the estimate of probability will be a judgment call in most cases. If you are lucky enough to have statistical data to lean on, make sure you validate that the data are representative for your situation.Good method descriptions should also give guidance on how to do these judgment calls.

Most risks you identify already have some risk reducing barrier elements. These can be things like alarms and operating procedures, and other means to reduce the likelihood or consequence of escalation of the scenario. Determining how much you are willing to rely on these other barriers is key to setting a requirement on your safety function of interest – typically a SIL rating. Standards limit how much you can trust certain types of safeguards, but also here there will be some judgment involved. Key questions are:

  • Are multiple safeguards really independent, such that the same type of failure cannot know out multiple defenses at once?
  • How much trust can you put in each safeguard?
  • Are there situations where the safeguards are less trustworthy, e.g. if there are only summer interns available to handle a serious situation that requires experience and leadership?

Risk assessmen methods are helpful but don’t forget that you make a lot of assumptions when you use them. Don’t forget to question your assumptions even if you use a recognized method, especially not if somebody’s life will depend on your decision.

SIL and ballast systems

Working on floating oil and gas facilities, one question keeps popping up about ballast systems. Should they have SIL requirements, and what should in this case the requirements be? When seeking to establish requirements for such systems, several issues are uncovered. First of all, current designs of ballast systems are very robust due to evolution of designs and requirements in shipping over a long time. Further, the problem is much more complex than collecting a few well-defined failure modes with random error data leading to a given situation, as typically seen in may process industry type problem descriptions. This complexity depends on a number of factors, and some of them are specific to each ship or installation, such as location, ship traffic density or operating practices of personnel onboard. Therefore, any quantitative estimates of “error probabilities” contributing to an expected return frequency of critical events concerning the system will have significant uncertainties associated with them.

ship-storm1

A ballast system is used to maintain the stability of a ship or a floating hull structure under varying cargo loading conditions and in various sea conditions and ship drafts. Water is kept in tanks dispersed around the hull structure, and can be pumped in or out, or transfered between tanks, to maintain stability. Errors in ballasting operations can lead to loss of stability, which in the worst consequence means a sunken ship. The ballasting operation is normally a semi manual operation where a marine operator would use a loading computer to guide decisions about ballasting, and manually give commands to a computer based control system on where to transfer water into or out of a particular ballast tank. Because this is such a critical safety system it is a natural question to ask: “what are the performance requirements?”.

Ballast systems have been part of shipping for hundreds of years. Requirements for ballast systems are thus set in the classification rules of ship classification societies, such as Lloyd’s Register, DNV GL or ABS. These requirements are typically presecriptive in nature and focus on robustness and avoidance of common cause failures in the technology. Maritime classification societies do not refer to safety integrity levels but rely on other means of ensuring safey operation and reliability. Society has accepted this practice for years, for very diverse vessels ranging from oil tankers to passenger cruise ships.

In oil and gas operations, the use of safety integrity levels to establish performance requirements for instrumented safety functions is the norm, and standards such as IEC 61508 are used as the point of reference. The Norwegian Oil and Gas Association has made a guideline that is normally applied for installations in Norwegian waters, which offers a simplification of requirements setting based on “typical performance”. This guideline can be freely downloaded from this web page. This guideline states that for “start of ballasting for rig re-establishment”, ths system should conform to a SIL 1 requirement. The “system” is described as consisting of a ballast control node, 2 x 100% pumps and three ballast valves. In appendix A.12 of the guideline a description of this “sub-function” is given with a calculation of achievable performance.

It may be argued that this functional description is somewhat artificial because the ballast system on a production installation is normally operated more or less continously. The function is defined for a single ballast tank/compartment, irrespective of the number of tanks and the necessary load balancing for re-establishing stability. The Guideline 070 approach is based on “typical performance” of the safety system as it is defined, and is not linked directly to the required risk reduction provided from the system. Multiple approaches may be taken to assign safety integrity levels based on risk analysis, see for example IEC 61508. One such method that is particularly common in the process industries and the oil and gas industry is “layers or protection analysis”, or LOPA for short. In this type of study, multiple initating events can contribute to one hazard situation, for example “sunken ship due to loss of stability”. Multiple barriers or “independent protection layers” can be credited for reducing the risk of this hazard being realized. In order to use a risk based method for setting the integrity requirement, it is necessary to define what is an acceptable frequency of this event occurring. Let us say for the sake of the discussion that it is acceptable that the mean time between each “sunken ship due to loss of stability” is 1 million years. How can we reason about this to establish requirements for the ballast system? The functional requirement is that we should “be able to shift ballast loading to re-establish stability before condition is made unrecoverable”. In order to start analyzing this situation, we need to estimate how often we will have a condition that can lead to such an unrecoverable situation if not correctly managed. Let us consider three such “initiating events”:

  • Loading operator error during routine ballasting (human error)
  • Damage to hull due to external impact
  • Error in load computer calculations

Both of these situations depend on a number of factors. The probability that the loading operator will perform an erronous situation depends on stress levels, competence/training and management factors. A throrough analysis using “human reliability analysis” can be performed, or a more simplified approach may be taken. We may, for example, make the assumption that the average operator makes 1 error without noticing immediately every 100 years (this is an assumption – must be validated if used).

Damage to hull due to external impact would depend on the ship traffic density in the area, if there is a difficult political condition (war, etc.), or if you are operating in arctic environments where ice impact is likely (think Titanic). Again, you may do extensive analysis to establish such data, or make some assumptions based on expert judgment. For example, we may assume a penetrating ship collition every 100 years on average.

What about erros in load computer calculations? Do the operators trust the load computer blindly, or do they perform sanity checks? How was the load computer programmed? Is the software mature? Is the loading condition unusual? Many questions may be asked here as well. For the sake of this example, let us assume there is no contribution from the loading computer.

We are then looking at an average initiating event frequency of 0.1 for human errors and 0.01 for hull damage.

Then we should think about what our options for avoiding the accidental scenario are, given that one of the initiating events have already occurred. As “rig re-establishment” depends on the operator performing some action on the ballast system, key to such barriers is making the operator aware of the situation. One natural way to do this would be to install an alarm for indicating a dangerous ballast condition, and train the operator to respond. What is the reliability of this as a protection layer? The ballast function itself is what we are trying to set the integrity requirement for, and any response of the operator requires this system to work. Simply notifying the operator is thus necessary but not enough for us. In case the ballast system fails when the operator tries to rectify the situation, the big question is, does the operator have a second option? Such options may be a redundant ballast system, not using the same components to avoid common cause failure. In most situations the dynamics will be slow enough to permit manual operation of pumps and valves from local control panels. This is a redundant option if the operator is trained for it. If the alarm does not use the same components as the function itself, we have an independent protection layer. The reliability of this, put together with the required response of a well-trained operator cannot be credited as better than a 90% success rate in a critical situation (ref. IEC 61511, for example).

So, based on this super-simplified analysis, are we achiving our required MTTF of 1 million years?

Events per year: 0.02.

Failure in IPL: Alarm + operator response using local control panels: 0.1.

OK, se we are achieving an MTTF of:

1/(0.02 x 0.1) = 500 years.

This is pretty far from where we said we should be. First of all, this would require our ballast system to operate with better than SIL 4 performance (which is completely unrealistic), and furthermore, it includes the same operator again performing manual actions. Of course, considering how many ships are floating at sea and how few of them are sinking, this is probably a quite unrealistic picture of the real risk. Using super-simple tools for adressing complex accidental scenarios is probably not the best solution. For example, the hull penetration scenario itself has lots of complexity – penetrating a single compartment will not threaten global stability. Furthermore, the personnel will have time to analyze and act on the situation before it develops into an unrecoverable loss of stability – but the reliability of them doing so depends on a lot on their training, competence and the installation’s leadership.

The take-away points from this short discussion are three:

  • Performance of ballast systems on ships is very good due to long history and robust designs
  • Setting performance requirements based on risk analysis requires a more in-depth view of the contributing factors (initators and barriers)
  • Uncertainty in quantiative measures is very high in part due to complexity and installation specific factors, aiming for “generally accepted” technical standards is a good starting point.

Get results by running fewer but better meetings

I’ve been to a lot of meetings – it is the battle ground of modern business. It is also where we make decisions, drive progress and get our priorities aligned. Most meetings, however, are just terrible energy drains. Bad meetings are bad for people, and they harm quality. It is not hard to claim that bad meetings are also bad for safety, if the workshops and meetings used to drive risk assessments and engineering activities are not well organized with a clear focus. Based on experience from a decade in meeting rooms, I’ve devised the following 5 rules of great meetings that I think are truly helpful.

Meetings can vary a lot in format and location - selecting the architecture of your meeting carefully is one of the rules for driving great meeting experiences.
Meetings can vary a lot in format and location – selecting the architecture of your meeting carefully is one of the rules for driving great meeting experiences.

Meeting Rule #1 If your meeting does not have a clear purpose, a specific agenda, and defined desired outcomes, the meeting shall not take place.

Meeting Rule #2 Carefully select attendees and share the purpose and agenda of your meeting with the attendees in advance, asking for feedback. Continue to foster debate and two-way interactions in the meeting.

Meeting Rule #3 Adapt architecture of meetings to the purpose, agenda and size of the meeting, by carefully selecting visual aids, meeting locations, duration and formality to fit your needs.

Meeting Rule #4 Stay close to the agenda to show that you value results, and at the same time give praise where praise is due both during the meeting and in the minutes. Make sure you make it very clear when you and your team have reached a desired outcome in your meeting.

Meeting Rule #5 Never invite to a meeting to drive outcomes you do not feel OK with from an ethical standpoint.

I’m very interested to hear what you think about these rules, and if you have other heuristics for making meetings work. I’m sure not all meetings I lead are great, but they are probably much better after I’ve realized a few things summarized in these rules, than they used be before. Tell me what you think in the comment field, or on Twitter (@sjefersuper).

Would you operate a hydraulic machine with no cover or emergency stop?

Sounds like a crazy idea, right? Research, however, has shown that about half of professional machine operators do not think safety functions are necessary. You know, things like panel limit switches, torque limiters and emergency stop buttons. Who would need that, right? This number comes from a report issued in Germany about 10 years ago, but I am not very optimistic in improvements in these numbers since then. The report can be found here: http://www.dguv.de/ifa/Publikationen/Reports-Download/BGIA-Reports-2005-bis-2006/Report-Manipulation-von-Schutzeinrichtungen/index.jsp (in German).

Machine safety is important for both users and bystanders. Manipulations of safety functions are common – and the risk increase is typically unknown to users and others. How can we avoid putting our people at risk due to degraded safety of our machinery?

Researchers have found that the safety functions of machines are frequently manipulated. This is typically done because workers perceive the manipulation as necessary to perform work, or to improve productivity. Everyone from machine builders, to purchasers to operrators should take this into account, to avoid accidents from happening. Consider for example a limit switch. A machine built to conform to the machinery directive (with CE marking) has to satisfy safety standards. Perhaps has a SIL 2 requirement been assigned to the limit switch because operation without it is deemed dangerous and a 100-fold risk reduction is necessary for operation to be acceptable. This means, if the limit switch is put out of function, the risk of operation is 100 times higher than the designer has intended!

What can we do about this? We need to design machine such that safety functions become part of the work flow – not an obstacle to it. If workers have nothing to gain in their own perception from manipulating the machine, they are not likely to do it either. This boils down to something we are aware of, but are not good enough at taking into account in design processes; usability testing is essential not only to make sure operators are happy with the ergonomics – it is also essential for the safety of the people using the machine!

Are you aware of the effect of work life balance issues on the quality of your team’s work?

Make sure your people do not feel like a hot kettle with nowhere to let the steam out – that can lead to broken designs – and if your line of work is designing safety critical systems, broken designs usually means a greater chance of loss of life, polluting the environment and large financial losses.

Having a way to control the internal steam pressure of your team members may be utopia - but you should still look for ways to avoid disasters, together with your people.
Having a way to control the internal steam pressure of your team members may be utopia – but you should still look for ways to avoid disasters, together with your people.

We all know that the quality of our work varies – with a large number of factors. If we are overworked or really worried about something in our personal lives – quality of our work will most likely suffer. If you are responsible for the functional safety in a large project, human error can be disastrous, not only for the project, but for the people working in the plant when it has become operational. Whether it is yourself, or an entire team that you are responsible for, you need to be aware of key performance shaping factors. These factors are described in detail in human reliability analysis, such as developed by Idaho National Labs for the nuclear industry: http://www.nrc.gov/reading-rm/doc-collections/nuregs/contract/cr6883/cr6883.pdf. These techniques can lend some terminology and thinking that is useful in the project itself, to help manage the risk of significant human errors in the project phase. Remember – misunderstanding the risk factors and barrier elements themselves may lead to insufficient barriers against major accident hazards in a real plant! The factors in the SPAR-H methodology described in the linked document are:

  • Available time
  • Stress/stressors
  • Complexity
  • Experience/training
  • Procedures
  • Ergonomics/HMI
  • Fitness for duty.
  • Work processes

These factors have been defined for typical process operators’ actions in a nuclear power plant but they are also relevant for other types of tasks. Functional safety work typically has a high degree of complexity. The experience and training of people involved in the safety lifecycle tend to vary a lot, and procedures and work processes are not always clear to everyone involved. All of this falls under “management of functional safety” and project managers should think about what creates great quality when planning and managing the project. In many projects, time is quite limited, and the term “schedule impact” is a rather frightening concept to many project managers. This can lead to tasks being perceived as less important simply because the schedule is prioritized over quality. For safety critical tasks, this should not be allowed to happen.

Some factors from your project members’ personal lives may have severe impact on performance. People working on the project team can be stressed or not “fit for duty” due to a number of challenges that are not only work related. How can we deal with this? Project managers need to know their teams beyond their tasks and work backgrounds. You need to create an environment of trust, such that you have a greater chance of catching such performance limiting factors originating from outside the organization. For many people these factors may not be something that is seen as “bad” such as divorce, alcohol abuse or depressions, it may simply be challenges in making daily life work. People tend to want balance in life – with room for work, family, friends, hobbies, etc., etc. Working in a high-stakes project may itself be a threat to a balanced life. By knowing your people you can help them find the necessary balance that will also improve their performance at work. Flexible work-hours, part-time telecommuting and close follow-up with real feedback to every member on your team can help.

We consider human factors and the effect of the work environment as well as external performance shaping factors for operators. We should also strive for people to perform at their best when their work is to design the very systems used by the operators after commissioning.