Chapter 9. Case Study Using Multiple Tools

In this chapter we’re going to discuss what to do if you need to use “other” tools for your particular data science pipeline. Python has a plethora of tools for handling a wide array of data formats. RStats has a large repository of advanced math functions. Scala is the default language of big data processing engines such as Apache Spark and Apache Flink. Legacy programs that would be costly to reproduce exist in any number of languages.

A very important benefit of Kubeflow is that users no longer need to choose which language is best for their entire pipeline but can instead use the best language for each job (as long as the language and code are containerizable).

We will demonstrate these concepts through a comprehensive example denoising CT scans. Low-dose CT scans allow clinicians to use the scans as a diagnostic tool by delivering a fraction of the radiation dose—however, these scans often suffer from an increase in white noise. CT scans come in a format known as DICOM, and we’ll use a container with a specialized library called pydicom to load and process the data into a numpy matrix.

Several methods for denoising CT scans exist; however, they often focus on the mathematical justification, not the implementation. We will present an open source method that uses a singular value decomposition (SVD) to break the image into components, the “least important” of which are often the noise. We use Apache Spark with the Apache Mahout library ...

Get Kubeflow for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.