Big Data Tools and Pipelines

Distributed deep learning on Spark

By Alexander Ulanov

Alexander Ulanov offers an overview of tools and frameworks that have been proposed for performing deep learning on Spark.

KeystoneML: Optimized large-scale machine learning pipelines on Apache Spark

By Evan Sparks

Evan Sparks describes the principles behind KeystoneML and introduces its programming model by way of example pipelines in NLP and image classification.

Sean Suchter on the promise and challenges of Spark

By Andy Oram

How Spark will fit into—and change—the current ecosystem of distributed computing tools.

An interview with Pythonista Katharine Jarmul

By Seth Grimes

Using Python, and other tools, for natural language processing, sentiment analysis, and data wrangling.

Apache Spark for atom-smashing experiments

By Siddha Ganju

Crunching CERN’s colossal data with scalable analytics

Introduction to TensorFlow

Learn the basics of machine learning and deep learning using TensorFlow.

Future-proof and scale-proof your code

By Jesse Anderson

Using Apache Beam to become data-driven, even before you have big data.

Intel’s internal IoT platform for real-time enterprise analytics

By Moty Fania

A single, multitenant platform built with open source technologies, based on an understanding of basic common needs.

Applying the Kappa architecture in the telco industry

By Nicolas Seyvet and Ignacio Mulas Viela

Kappa architecture and Bayesian models yield quick, accurate analytics in cloud monitoring systems.

Learn how to analyze and scale large datastores using Elasticsearch

By Radu Gheorghe

Radu Gheorghe demonstrates how to create, retrieve, update, and delete documents in Elasticsearch. He also covers special Elasticsearch fields, like _type, _source, and _version, and the relationship between Elasticsearch shards and Lucene indices.

Building a scalable, secure platform: If I knew then what I know now

By Bill Loconzolo

Bill Loconzolo reveals the lessons learned from building the Intuit Analytics Cloud.

Apache Spark 2.0: Introduction to Structured Streaming

Michael Armbrust and Tathagata Das explain updates to Spark version 2.0, demonstrating how stream processing is now more accessible with Spark SQL and DataFrame APIs.

How to build an anomaly detection engine with Spark, Akka, and Cassandra

By Natalino Busa

Natalino Busa presents the Coral system, a solution for streaming anomaly detection.

What is a resilient distributed dataset?

By Alex Robbins

Alex Robbins guides you through an in-depth look at the Python API for Apache Spark. In this segment, he explores RDDs--the central abstraction in Spark and essential knowledge for anyone working in the system.

Using design patterns to find greater meaning in your data

By Julie Rodriguez and Piotr Kaczmarek

Visualizations that show comparisons, connections, and conclusions offer analytical clarity.

Conquering data preparation at enterprise scale

By Shannon Cutt

The O’Reilly Podcast: Nikolaus Bates-Haus on tools and techniques for addressing data variety and augmentation at scale.

Installing Jupyter Notebook Extensions

By Jonathan Whitmore

Jonathan Whitmore demonstrates how to install pivot tables and showcases the features of this extension by examining a dataset of restaurant scores.

Accelerating real-time analytics with Apache Spark

By Sean Owen and Yann Delacourt

Sean Owen and Yann Delacourt cover Spark's architecture, deployment strategies, and use cases, as well as Spark's impact on data science, analytics, and machine learning.

Quality of Service for Hadoop: It’s about time!

By Vinod Nair

How QoS enables business-critical and low-priority applications to coexist in a single cluster.

Dive into scikit-learn

By Andreas Mueller

With scikit-learn, you can deploy machine learning models in just a few lines of code. Andreas Mueller summarizes the classification, regression, and clustering algorithms in this powerful machine learning library.

Get started with deep learning in computer vision

By Pete Warden

Pete Warden walks through popular open source tools from the academic world and shows you step-by-step how to process images with them.

The next 10 years of Apache Hadoop

By Mike Cafarella, Ben Lorica and Doug Cutting

Apache Hadoop co-founders Doug Cutting and Mike Cafarella explore the future of Hadoop.

Driving the on-demand economy with predictive analytics

By Eric Frenkiel

Eric Frenkiel explains how a trinity of real-time technologies—Kafka, Spark, MemSQL—is enabling Uber and others to power their companies with predictive apps and analytics.

Let’s get real: Acting on data in real time

By Jack Norris

Companies are differentiating themselves by acting on data in real time. But what does “real time” really mean? Jack Norris discusses the challenges of coordinating data flows, analysis, and integration at scale to shape business as it happens.