Chapter 11. Data Science and R

Data science is a relatively new discipline that first came to the attention of many with this article by O’Reilly’s Mike Loukides. While there are many definitions in the field, Loukides distills his detailed observation of and participation in the field into this definition:

A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.

One of the main open source ecosystems for data science software is at Apache and includes Hadoop (which includes the HDFS distributed filesystem, Hadoop Map/Reduce,1 Ozone object store, and Yarn scheduler), the Cassandra distributed database, and the Spark compute engine. Read the “Modules and Related Tools” section of the Hadoop page for a current list.

What’s interesting here is that a great deal of this infrastructure, which is taken for granted by data scientists, is written in Java and Scala (a JVM language). Much of the rest is written in Python, a language that complements Java.

Data science problems may involve a lot of setup, so we’ll only give one example from traditional DS, using the Spark framework. Spark is written in Scala so it can be used directly by Java code.

In the rest of the chapter I’ll focus on a language called R, which is widely used both in statistics and in data science (well, also in many other sciences; many of the graphs you see in refereed ...

Get Java Cookbook, 4th Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.