Archive for the 'Data Analysis' Category

Dating and InfoSec

So if you don’t follow the folks over at OKCupid, you are missing out on some hot data. In case you’re not aware of it, OKCupid is:

the best dating site on earth. Compiling our observations and statistics from the hundreds of millions of user interactions we’ve logged, we use this outlet to explore the data side of the online dating world.

And in their latest post, they explore what brand of camera makes you look good. You should go read “Don’y be ugly by accident.” I’ll wait.

You’re back? Ok. So here, let me lay this out for you. These folks are applying science, not to dating, but to online dating profiles. They’re not slinging some best practice shtick, or re-writing profiles at $50 a pop, they’re telling you exactly what photos work and which ones don’t. How are they doing this? Data. Experiment. Analysis.

I don’t want to understate the importance of finding a good partner, but I will say how sad it is that they have all this data on this highly intimate activity, and we have 2,000 entries in DatalossDB.

Alex on Science and Risk Management

Alex Hutton has an excellent post on his work blog:

Jim Tiller of British Telecom has published a blog post called “Risk Appetite, Counting Security Calories Won’t Help”. I’d like to discuss Jim’s blog post because I think it shows a difference in perspectives between our organizations. I’d also like to counter a few of the assertions he makes because I find these to be misunderstandings that are common in our industry.

“Anyone who knows me or has subjected themselves to my writings knows I have some uneasiness with today’s role of risk. It’s not the process, but more of how there is so much focus on risk as if it were a science – but it’s not. Not even close.”

Let me begin my rebuttal by first arguing that risk management, at its basis, is at least ”scientific work”. What I mean by that is elegantly summed up by Eliezer Yudkowsky on the Less Wrong blog. To use Eliezer’s words, I’ll offer that scientific work is “the reporting of the likelihood ratios for any popular hypotheses.”

You should go read “Risk Appetite: Counting Risk Calories is All You Can Do“.

Getting the time dimension right

If you are developing or using security metrics, it’s inevitable that you’ll have to deal with the dimension of time.  It’s harder than it looks and I’ve seen many people make mistakes with it, and in doing so, rendering their overall metrics faulty or worse.  The problems often start with our basic concepts and how we use words.

"Time flies like an arrow, but fruit flies like bananas" -- Groucho Marx

“Data” tells you about the past

“Data” is the output of some observation or measurement process.  If your data is about some states of the world, then by definition your data lives in the past.  You did your measurements or your experiments, generated your data, and then time passed as you assess it, report it, and act on it.  Thus, your data is reporting on history.  Only by acts of inference can you connect your data with the present state of the world or the future state.

In the physical sciences and engineering, they can safely assume that the system under study is the same over time — past, present, and future.  This is called the ergodic hypothesis.  In statistics, the underlying stochastic process is treated as stationary.   This makes it possible to extrapolate the past into the present and future using regression and other techniques.

There are people in the security metrics community that only want to operate on data.   They view anything that is not the result of empirical measurement is pure speculation or a dangerously-seductive “model”.    (See Models are Distracting, and Measurement over Models)    Being an engineer myself, I’m all in favor of empirical data, measurment, and experiments.  But I contend that we will never get to measures of “security” or “risk” through empirical data alone.   Our systems are non-stationary and non-ergodic.

“Security” is a judgement about the present

If we start with the simple high-level question: “Am I secure?”, it becomes clear that any measurement of security must relate to the present time (or possibly a retrospective view on a previous time, i.e. past perfect tense, or prospective view on a future time, i.e. “will I be secure?”).  I call it a “judgement” because security depends on the threats you are facing.  (I play a historically-realistic computer game with my son, called Total War, that includes features that allow you invest in offensive and defensive capabilities.  How much to invest and how fast to invest depends on who you are facing.  A wooden pallisade will be an adequate defense against peasants and spear militia, but hopelessly inadequate against onagers and trebuchets, backed by armored cavalry!)

Thus, you can measure anything and everything you want about security, generating tons of data, and in the end you will have to make a judgement:  “Am I secure?” — or are my security provisions adequate given the threats we face?   Seen this way, your data is really just evidence that is used in this judgement (and inference) process.   What I mean by this is that I don’t think you can simply calculate your way from ground-truth data to any overall security metrics.  There will always be a judgement or inference step(s).

Why?  Because we must account for events, circumstances, and scenarios that haven’t happened yet, or happen so rarely that we have no relevant data, or are beyond the reach of measurements.  (Afterall, the miscreants often do their best to hide their actions.)   On top of this, the security landscape changes rapidly and occasionally dramatically.  Our judgement about security must factor in these changes, to the best of our knowledge.   Finally, our judgement about “are we secure?” is predicated on our risk tolerence.  But what is “risk”?

“Risk” is a cost of the future, brought to the present

This is the economist’s definition of risk, where “cost” here means downside cash flows that are beyond some  threshold of expectation or variability.  Those costs become “risk” when you can account for them in present dollars using some discounting and insurance method.  (This says nothing about the “insurability” of the risk, only about the theoretical possibility of accounting for risk in present dollars by some reasonable method.  The “insurance method” might be diversification, hedging, self-insurance, risk pooling, contingent contracts, or traditional insurance.)

This parallels Peter Drucker’s characterization of profit: “Profit is … needed to pay for attainment of the objectives of the business. Profit is a condition of survival. It is the cost of the future.  The cost of staying in business.” [emphasis added]   Ontologically, “profit” and “risk” are in the same category, which is why it makes sense to measure “risk-adjusted return” and the like.

From the viewpoint of risk, what you have spent in the past is irrelevant  (“sunk costs”).  All rational decisions are based on future cash flows and options.  The only value of the past is if it helps you predict or forecast the future.  Thus, you can’t reach a final judgement about security in the present if you don’t also have some useful estimate of risk in the future.   If the answer to “Am I secure?” is “Yes”, then the implication is that you can live with the risk associated with this level of security.   By “useful”, I mean sufficiently discriminating to inform the judgement — “bigger than a breadbox, smaller than a house”.

This is where information security deviates from reliability engineering.   In the latter, the ergodic hypothesis holds and the dynamics are sufficiently “tame” to permit statistical data analysis for inference and forecasting.  Even when there are “humans in the loop”, their behavioral tendencies can often be characterized by stable probability distributions.  In information security, we are dealing with adaptive, intelligent, strategic players — not only miscreants, but also “ancillary players” like end-users, auditors, supply chain partners, and so on.  This makes risk estimation a ”wicked problem“.  But is it hopeless?

Estimating risk may be hard, but not impossible

Plenty of smart security people contend that quantitative risk estimation is impossible or infeasible in principle.  Proving or disproving this assertion would take heavy-duty theoretical analysis (and I may do it some day).  But for now consider two extreme situations.

Think of security and risk as a black-box process that generates a continuous stream of cash flows in time (i.e. total spending on security and losses in that time period).  At one extreme, the output is a stationary function or stochastic process.  This is the relm that Nicholas Taleb called “Medicoristan“, since the data stream is well-behaved enough that nothing very surprising happens.  With enough historical data and enough data analysis, I think we’d all agree that risk estimation is feasable with current methods.

At the other extreme, the output is generated by a strategic agent (inside the box) whose sole purpose is to screw up our risk estimation process.  Let’s call this Descartes’ Demon, after Rene Descartes, who introduces a skeptical scenario called the deceiving demon argument to challenge our beliefs that an external world exists; in particular, it raises the possibility that some sort of malicious, demonic non-God, has “employed all his energies in order to deceive me”.    If Descartes’ Demon can maintain history of the output and also has information about our risk estimation process, he can mimic any output pattern and change those patterns arbitrarily to defeat any estimation process we might apply.   (This is more extreme than Taleb’s “Extremestan” in terms of defying estimation or prediction.)   In this case, I believe it could be proved that estimation is impossible (or undecidable or infeasable from a computation point of view).

Some people might argue that information security is exactly in this latter extreme situation, but I don’t think so.  The reason is that all the players have much stronger motives and forcing functions than to subvert the risk estimation processes.  Bad guys want to make money or cause harm.  End users want to avoid hassles and minimize effort and get their job done.  Managers want to manage their business while avoiding negative repercussions.  All of these factors add some elements of predictability and understandability.

But it may only be possible to factor all of these in through the use of models and simulations that represent our best knowledge, our best estimates, and our best beliefs about how they all relate to each other and the overall results.

The marriage of data, security, and risk = social learning processes

Putting this all together, we need to gather a lot of empirical data to understand relationships, patterns, and dependencies.  But to measure security we need to add inference and judgement processes that extend our data into the present, given the threat landscape we believe we are facing.  But to make a judgement about security and make decisions about alternative security postures, we need a useful estimate of risk to decide how much security is enough.  To tie these all together over time requires effective social learning processes, including model validation through experiments and data analysis.  Likewise, risk estimation and security judgement processes tell us what data we need to collect and how to analyze it.

Whether you agree with this framework or not, you should make explicit and consistent definitions of the time dimension relative to your metrics.

Source, Data or Methodology: Pick at least one

dr-evil.JPG
In the “things you don’t want said of your work” department, Ars Technica finds these gems in a GAO report:

This estimate was contained in a 2002 FBI press release, but FBI officials told us that it has no record of source data or methodology for generating the estimate and that it cannot be corroborated…when we contacted FTC officials to substantiate the estimate, they were unable to locate any record or source of this estimate within its reports or archives, and officials could not recall the agency ever developing or using this estimate.(“US government finally admits most piracy estimates are bogus,” Ars Technica)

Of course, no one in information security would ever do such a thing.

Friday Visualization: Wal-mart edition

I’ve seen some cool Walmart visualizations before, and this one at FlowingData is no exception.

The one thing I wondered about as I watched was if it captured store closings–despite the seemingly inevitable march in the visualization, there have been more than a few.

On Uncertain Security

One of the reasons I like climate studies is because the world of the climate scientist is not dissimilar to ours.  Their data is frought with uncertainty, it has gaps, and it might be kind of important (regardless of your stance of anthropomorphic global warming, I think we can all agree that when the climate changes, crazy things can happen).

Recently, the mainstream press has begun to pick up on this, and trying to explain what science is doing.  One such example is this Times (UK) story called

Scientists Need The Guts To Say, “I Don’t Know”

In it, the author (David Spiegelhalter – Professor of the Public Understanding of Risk at the University of Cambridge) discusses uncertainty in past (and forward) looking predictions.  Yes, it’s worth noting that the science of prediction applies to all three states of time: past, present, and future.

As a security professional, I always encourage the representation of uncertainty.  Depending on the audience, I’ll represent uncertainty technically, or at a high level with words like “back of the napkin, very rough, a lot of unknowns, fairly certain, pretty good idea…”  I’ve found that as long as they are properly qualified, demonstrations  of risk with high degrees of uncertainty are not unuseful.

HEY, YOU GOT YOUR VISIBILITY INTO MY UNCERTAINTY!!! AND YOU GOT YOUR UNCERTAINTY IN MY VISIBILITY!!!

They really *are* two great tastes that taste great together….

One of the great reasons for the IT Risk management/Security team to communicate uncertainty (esp. to others with money) is that if you say “here’s what we think but we’re not sure “,  you can then tell the business owner “and if you give me $funding we can decrease that uncertainty by gaining visibility into $whatever”.  If they decline, they’re accepting both the risk and the probability that you’re wrong.  But if they’re uncomfortable with the uncertainty, now you have a pretty good qualitative way of knowing that their tolerance for this level of risk is pretty low, and you might even be able to skip right past the “buy more visibility” step above and move right into “of course, we can just spend $Y and take care of the whole thing, visibility, risk reduction and all….”

Similarly, if you, the security manager, keep getting risk analyses back that have significant uncertainty in them – you know that these are areas where you really don’t have much control.  They may represent reasons or opportunities to strengthen policies, processes, capabilities (w00t everybody goes to training in Cancun!) and so forth.

So while it’s also the enemy of accuracy, uncertainty can also be your friend.

One last note, having to do with uncertainty; in the article the author uses the Taleb definition of “Black Swan”.  Again, calling a rare event a “Black Swan” is a misnomer.  Rarity in frequency is only one aspect of what the concept of Black Swan represents.  A much better definition of a Black Swan is “an occurance which is not representable at all given our prior distributions.  Certainly, even after before Prof. Spiegelhalter corrected the model for double yoked eggs – the occurance of 6 is not a true Black Swan.  We could have run MCMC sims until our computers melted into hot lumps of toxic waste and various occurrences of double yoked eggs would/could have been represented.

Data void: False Positives

There’s a good post at Gartner pointing out the lack of data reported by vendors or customers regarding the false positive rates for anti-spam solutions.  

Although Gartner customers almost never complain about false positive rates, I wonder if false positives are under estimated. End users rarely complain about false positives, but they are very vocal reporting Spam in their inbox. Box Sentry (www.boxsentry.com) recently did a tests in a number of organizations and found the false positive rate in some organizations using popular anti-spam tools was as high as 13% of legitimate emails. The largest proportion of false positives in their study was legitimate person-to-person traffic.  While it could be that these organizations have over-tuned their systems to block more Spam at the expense of quarantining more legit email, the reality was the email administrators had no idea they had such a high false positive rate because they never checked.  Have you? 

Going further, it would be very valuable to estimate the cost of false positives.

As I’ve discussed in a previous post, this is just another instance of a general problem in the security industry.  You can’t do rational analysis of effectiveness, cost-effectiveness, risk, and the rest without some estimate of false positive rates and their costs.

Symantec State of Security 2010 Report Out

http://www.symantec.com/content/en/us/about/presskits/SES_report_Feb2010.pdf

Thanks to big yellow for not making us register!  Oh, and Adam thanks you for not using pie charts…

The Visual Display of Quantitative Information

In Verizon’s post, “A Comparison of [Verizon's] DBIR with UK breach report,” we see:

pie-charts-suck.jpg

Quick: which is larger, the grey slice on top, or the grey slice on the bottom? And ought grey be used for “sophisticated” or “moderate”?


I’m confident that both organizations are focused on accurate reporting. I am optimistic that this small example in the utlity of pie charts will inform report writers. The report writers and their graphics departments, loving their customers, will move to bar charts to help them compare numbers between sources.

I’m confident that not using pie charts is a best practice.

Elsewhere: “The only time it makes sense to use a pie chart.”

And elsewhere: “The Visual Display of Quantitative Information, 2nd edition

Does It Matter If The APT Is “New”?

As best as I can describe the characteristics of the threat agents that would fit the label of APT, that threat community is very, very real.  It’s been around forever (someone mentioned first use of the term being 1993 or something) – we dealt with threat agents you would describe as “APT” at MicroSovled when I was there in 2001-2005.  We dealt with it as a firewall vendor at Progressive Systems in 1998.  This isn’t a “is the APT real?” blogpost.

That said, I wanted to talk about why there should be still more discussion around the APT.  Hogfly at the Forensic Incident Response blog asks:

“What should matter is how successful they have been. What should matter is defending ourselves. What should matter is how and where we share this information. What should matter is taking this information to those with the ability to do something about it. What should matter is taking the fight to the enemy.

So I ask again, does it matter if this threat is new?”

My response is that it actually matters very much.

We are hearing a new label.  Whether the label originated from “the cool kids” or not, it’s being co-opted by marketing.  And right now, we’re sort of in this important window of trying to get some understanding, some significant amount of intersubjectivity about what the APT is and what it means to a broader audience.  Once that’s established, then we can try to understand what to do.  But why does it matter if the threat is new or old?

There is a significant increase in the use of the term.  When it’s a BusinessWeek cover story (2008, btw), it gets seen by people.  What we need to understand is if this “new” visibility is the result of either a change in the threat landscape or a change in the marketing landscape.

IS APT A SHIFT IN FREQUENCY, A SHIFT IN CAPABILITY, OR A SHIFT IN BOTH FREQUENCY AND CAPABILITY?

If it is a change in the threat landscape, we need to understand what aspect of the landscape is changing.  The shift could be said to be one of a few scenarios:

1.)  More attacks on the same targets by the same actors. That is, is the government, defense industrial base, or other targets attractive to certain nation-states are experiencing a new amount of threat events.

2.) More attacks on new targets by the same actors. That is, are the nation-state actors finding new targets?  If so, are their targets of choice changing from organizations that are antagonistic to the policy desires of the sponsor state (certainly the Mandiant report reads like the Chinese are after anyone who threatens their political stability), to other targets – like retailers or hospitals (has, as Mandiant says, the APT become *everyone’s* problem)?

3.)  More attacks on the same targets by new actors. That is, it’s not just the usual suspects.  If *this* is the case, then we’re seeing a fundamental shift in the capabilities of threats.  That is, bad guys who used to be dumb just got a lot smarter thanks to the dissemination of skills/resources (sharing of technique, new access to advanced toolsets, etc) and they are going after all those people who were worrying about the APT in 2003.

4.)  More attacks on new targets by new actors. That is, the bad guys who used to be dumb just got a lot smarter and are now trying to use their new smarts against victims who heretofore had not had to worry about the APT.

Finally, the other option is that there is no shift in frequency or capability, but there is a shift in marketing budgets.  I tried to run a google trend on “Advanced Persistent Threat” but got:

Your terms – “Advanced Persistent Threat” – do not have enough search volume to show graphs.

And “APT” trend search was clouded by other things that shared the same TLA.

WHAT DO YOU THINK?

I’m not sure what we’re seeing.  I was personally disappointed by the Mandiant report’s lack of demographics and frequency information.  I’m ready to believe that we’re seeing a fundamental shift in distributions concerning the threat agents, but there wasn’t anything in the report to support that notion.  I will leave you with a couple of items from the Verizon Report, though, and I’ll let you draw your own conclusions, given that the Verizon data set isn’t heavy on what we might call the Defense Industrial Base – those folks already live and breathe this stuff  – and this data is from 2008.

SOURCE OF ATTACKING IP

TARGETED VS. OPPORTUNISTIC ATTACKS

TREND IN USE OF CUSTOMIZED MALWARE

TIME TO DISCOVERY