Chapter 4. Big Data Analytics with Spark SQL, DataFrames, and Datasets
As per the Spark Summit presentation by Matei Zaharia, creator of Apache Spark (http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote), Spark SQL and DataFrames are the most used components of an entire Spark ecosystem. This indicates Spark SQL is one of the key components used for Big Data Analytics by companies.
Users of Spark have three different APIs to interact with distributed collections of data:
- RDD API allows users to work with objects of their choice and express transformations as lambda functions
- DataFrames API provides high-level relational operations and an optimized runtime, at the expense of type-safety
- Dataset API that combines the worlds ...
Get Big Data Analytics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.