Chapter 5. Anomaly Detection in Network Traffic with K-means Clustering

Sean Owen

There are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say, we know there are some things we do not know. But there are also unknown unknowns, the ones we donât know we donât know.

Donald Rumsfeld

Classification and regression are powerful, well-studied techniques in machine learning. ChapterÂ 4 demonstrated using a classifier as a predictor of unknown values. But there was a catch: in order to predict unknown values for new data, we had to know the target values for many previously seen examples. Classifiers can only help if we, the data scientists, know what we are looking for and can provide plenty of examples where input produced a known output. These were collectively known as supervised learning techniques, because their learning process receives the correct output value for each example in the input.

However, sometimes the correct output is unknown for some or all examples. Consider the problem of dividing up an ecommerce siteâs customers by their shopping habits and tastes. The input features are their purchases, clicks, demographic information, and more. The output should be groupings of customers: perhaps one group will represent fashion-conscious buyers, another will turn out to correspond to price-sensitive bargain hunters, and so on.

If you were asked to determine this target label for each new customer, you would ...

Get Advanced Analytics with Spark, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Advanced Analytics with Spark, 2nd Edition by Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

Chapter 5. Anomaly Detection in Network Traffic with K-means Clustering

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly