Chapter 7. Bridging Spark and Deep Learning Frameworks
So far, the main focus of this book has been on leveraging Spark’s capabilities for scaling machine learning workloads. But Spark is often a natural choice for scalable analytics workloads, and in many organizations, data scientists can take advantage of the existing teams supporting it. In this scenario, data scientists, data engineers, machine learning engineers, and analytics engineers are all consumers and/or creators of the data and share responsibility for the machine learning infrastructure. Using a scalable, multipurpose, generic tool such as Apache Spark facilitates this collaborative work.
But while Spark is a powerful general-purpose engine with rich capabilities, it lacks some critical features needed to fully support scalable deep learning workflows. This is the natural curse of development frameworks: in the distributed world, every framework needs to make decisions at the infrastructure level that later limit the possibilities of the API and constrain its performance. Spark’s limitations are mostly bound to its underlying premise that all algorithm implementations must be able to scale without limit, which requires the model to be able to perform its learning process at scale, with each step/iteration distributed across multiple machines. This is in keeping with the Spark framework’s philosophy that the size of the cluster changes the duration of the job, not its capacity to run an algorithm. This implies that ...
Get Scaling Machine Learning with Spark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.