Anomaly detection in cybersecurity

stoomart · Jul 27, 2017

This question is primarily directed to @bapowell, but I encourage others to please add any thoughts or suggestions.

Brian, I just saw your bio while reading the CMB primers, and thought you may have some ideas on cybersecurity data analytics.

Some background: I've been in cybersecurity since 2000, and have been using Splunk for anomaly detection and investigation for just over a year now. Instead of opting for Splunk's SIEM package, I've been developing our anomaly detection logic from scratch, which has evolved over time to include any combination of the following:

volume (count)
commonality (count distinct entities)
frequency (relative time comparison)
variance (entity or population z-score)

Am I missing any ways of looking at the data?

Variance detection was the last major evolution in my efforts, and now I am looking for the next one. I will say my reseach and testing in machine learning was a bit of a dud, since I could only ever achieve ~80% accuracy instead of high 90s like I was hoping for, but this may have been a limitation of my abilities.

bapowell · Jul 28, 2017

Hi there. What kinds of events/activities are you analyzing? What is an example of an "entity"? My experience so far has been that the necessary data and interesting features are very much determined by the specific problem you're trying to tackle. I hesitate to make a generic list of metrics for this reason.

What kinds of problems have you tried to solve with machine learning? What's your background, if you don't mind my asking?

stoomart · Jul 28, 2017

bapowell said:

Hi there. What kinds of events/activities are you analyzing?

Logs from web servers, perimeter security devices (fw, waf, ips), internal netflow, server logons, database access/audit/alert, endpoint security, software/hardware installs, and others in line with the CIS top 20 controls.

What is an example of an "entity"?

This would be the actor in an event such as an internal user/machine, or external client.

My experience so far has been that the necessary data and interesting features are very much determined by the specific problem you're trying to tackle. I hesitate to make a generic list of metrics for this reason.

I agree, all my triggers are built around the individual data variables and what kind of anomaly I'm interested in. Sorry for the generic nature of this question, I'm hoping I've missed something obvious, but have a sense machine learning is the only way to really jump forward from this point.

What kinds of problems have you tried to solve with machine learning?

Most of my experience with machine learning was training DLP to identify proprietary source code files unique to the company running it, this product worked very well. My own efforts were focused on identifying anomalies in network behavior from netflow data using Splunk's machine learning engine.

What's your background, if you don't mind my asking?

I got started in security in high school with a major security vendor (big yellow), supported and administered every type of security product you can think of, got my CISSP somewhere in there, and am now the technical lead on a security team of 4 at an independent state agency.

bapowell · Jul 28, 2017

One project I'm working on currently is using a learning algorithm to detect data exfiltration. The data that we're feeding to the classifier are suitably transformed netflows; it's currently not clear which features we need to sufficiently (and minimally) characterize a given flow record, but I'm hoping to make it port/protocol agnostic and perhaps independent of actual amounts of traffic per connection. Preliminary results are promising, but a big part of the challenge is realistically modeling the exfiltration.

stoomart · Jul 29, 2017

The features I found most helpful in machine learning were connection count, upload bytes, and download bytes. My variance triggers calculate these three values for each entity (user, client) or object (port, webhost) in their target data set by time buckets (1h, 6h, 1d); the latest bucket for each entity/object is then compared to previous buckets to identify sigma spikes in any of the calculated fields.

Anomaly detection in cybersecurity

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

France to ditch Windows for Linux

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Anomaly detection in cybersecurity

Similar threads