Time to Put Some Order - Cluster Your Data with Spark MLlib

"If you take a galaxy and try to make it bigger, it becomes a cluster of galaxies, not a galaxy. If you try to make it smaller than that, it seems to blow itself apart"

- Jeremiah P. Ostriker

In this chapter, we will delve deeper into machine learning and find out how we can take advantage of it to cluster records belonging to a certain group or class for a dataset of unsupervised observations. In a nutshell, the following topics will be covered in this chapter:

  • Unsupervised learning
  • Clustering techniques
  • Hierarchical clustering (HC)
  • Centroid-based clustering (CC)
  • Distribution-based clustering (DC)
  • Determining number of clusters
  • A comparative analysis between clustering algorithms ...

Get Scala and Spark for Big Data Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.