Eridrus 11 days ago

I am skeptical of anomaly detection since in my experience anomalies are common and diverse and don't actually matter, so I expect these systems to basically inundate people with false positives.

Their offline training accuracy is garbage: 16% precision, so all of the real work is basically being done in the online training portion, which gets it to a respectable 82%+ precision.

But they don't tell you how many alerts they had to label to get those numbers. Maybe over the long run you get those numbers, but you really want to know if it takes 10 or 10,000 examples to get there.

Also, their dataset distribution is very different to reality: they have 7% of their dataset annotated as real anomalies; I don't think anyone in the real world wants 5% of their log entries to get flagged as anomalies. So I expect their precision numbers to be far worse on more realistically distributed logs.

  • jstarfish 11 days ago

    It's a good time to make some money baffling people with bullshit in the cybersecurity space.

    Of course if you let an ML-powered "anomaly detection" engine run rampant on your logs, it's going to find anomalies...just like if you hire a ghost hunter, you'll be informed that your house is haunted. In the end, ghost chasing is all this anomaly nonsense turns out to be-- the justifications for conclusions by ML practitioners and ghost hunters alike tend to be equally mumbly and hand-wavy.

    Me working from home is technically an anomaly, and one these systems are all too eager to flag. We get random logins from overseas VPSes-- it's an anomaly! Oh, wait, no, we onboarded a client application. Oh, look, a random login from China for a US-based employee with no history of foreign logins! Yeah, that guy just started in a new position with travel requirements. Hey, this IP just tried to log into 5000 user accounts! Congratulations, you just alerted me to the existence of carrier NAT.

    None of this saves any time and usually wastes it, since it stirs up paranoia where none was otherwise warranted. It's a fun toy that gives the appearance of being productive when all it's actually doing is generating literally endless busywork. Good for justifying your SOC budget I suppose.

    But in the end nobody wants to pay a quarter-million dollars for a black box that just sits there quietly-- if it's not constantly drawing attention to itself and all the badness it's pretending to find, you're not going to have any reason to renew the license.

    "Renew it? Why? This thing didn't find anything at all last year."

    • Eridrus 11 days ago

      Oh I know, I spent a decade in Security and work in ML now, and I can see how badly people want to put the two together, but it's basically 90% bullshit and 10% same old shit of varying effectiveness.

    • noir-york 11 days ago

      So what does your organisation use for intrusion detection? Humans eyeballing logs doesn't scale. Rule-based approaches?

      • russh 11 days ago

        Mostly user complaints...

  • stephengillie 11 days ago

    At a previous tech support position, I collaborated with a data scientist to create a predictive alert system based on system notification data. It would monitor the quantity of noise from each interface on the network and alert on anomalies in Slack. The only issue was that it didn't work - we saw only false positive noise, and it sat quiet during actual incidents. It would be interesting to see another team's attempts, and what different design choices they make.

  • asavinov 11 days ago

    Another problem of anomaly detection is that they do not provide any (domain specific) explanation for why the system thinks it is an anomaly. The system also does not say what to do in this situation, which means that such anomalies are not actionable findings. Therefore I think anomaly detection should be used as a pre-processing step which generates input for some other other components of the system.

  • dimitry12 11 days ago

    Are they doing supervised training for anomalies?

pilooch 11 days ago

I do, with others, a lot of ML anomaly detection in the cyber security context. Deeply has interesting ideas, especially the encoded logs via lstm. The work was presented at a workshop at NIPS 2017.

One of the interesting facts we ve been able to measure empirically over the past few years is that the statistical anomalies' scores magnitude as reconstruction error are uncorrelated with the criticality of the anomaly in terms of security / threat.

This means that in practice SOC operators need to label on top of the anomaly detection and a supervised model can do the reranking after a while.

thaumaturgy 11 days ago

This is an interesting paper, but it sort of sidesteps one of the harder problems in generalized machine learning for log analysis:

> As shown by several prior work [9, 22, 39, 42, 45], an effective methodology is to extract a “log key” (also known as “message type”) from each log entry. The log key of a log entry e refers to the string constant k from the print statement in the source code which printed e during the execution of that code.

So if you're looking for a way to apply this to log data that varies wildly, like site access logs, you still have the difficult problem of converting the URIs to the numeric vectors needed by ML algorithms without losing the significant parts of the input.

asavinov 11 days ago

Here is another generic approach to anomaly detection from event data which has been used for analyzing logs received from automatic lawn mowers:

It allows for using different algorithms like one class SVM or MDS (including custom algorithms). It also allows for defining custom domain specific features as integral part of its analysis engine. In particular, for log analysis, frequencies of various event types have been generated.

lindig 11 days ago

The authors are using their own Spell[1] tool to parse syslog files into patterns that represent the fixed part of printf-like log statement. Is the source of that available? At the heart of this is a tree-based construction that is not well explained.


ram_rar 11 days ago

Has anyone in real production systems benefit from anomaly detection of logs ? I have usually converted some of the important events in logs to metrics and alerted users based on simple moving averages / spikes etc. I have usually started with alerts from system level metrics and then checked the logs. Applying Anomaly detection to logs directly hasn't worked for me yet.

  • gesman 11 days ago

    O yes.

    Applying K-Means clustering across different features of online traffic always shows some weird and often malicious stuff:

    • slv77 8 days ago

      Care to share more about what kinds of features you cluster on?

bhnmmhmd 11 days ago

I was wondering, has anyone here applied cluster analysis techniques for anomaly detection?

I read a paper that used it for insurance fraud detection, but I don't know what other fields are using clustering to detect frauds and abnormalities?

I'd be grateful if someone can help.

  • gesman 11 days ago

    Yes, tons of that.

    See this - using K-Means clustering for anomaly detection in web traffic:

    Using DBscan clustering for anomaly detection in healthcare claims data (detecting doctors who anomalously prescribing opioids). Using public CMS data set from 2015.

    4 out of 8 top anomalies (doctors) were later actually convicted of crimes or gone into all sort of troubles with DOJ:

    (Splunk Enterprise + free apps was used to ingest data and build all this logic and dashboards)

    • bhnmmhmd 11 days ago

      Thank you so much, it really was helpful.

cphoover 11 days ago

Is there a github for DeepLog?

  • mino 11 days ago

    I had contacted the first author in March and the answer was that "our source code is currently not available because of a pending patent application".

sscarduzio 11 days ago X-Pack has machine learning for log anomalies and people buy and use that stuff. Has anybody direct experience with that?

  • dimitry12 11 days ago

    I don't but I was researching the space and has the most feature-rich product - though they only discover anomalies in numeric time-series.

    • ygur 10 days ago

      Check out for a spot on AI log analysis