Speeding up PySpark with DataFrames
The significance of DataFrames and the Catalyst Optimizer (and Project Tungsten) is the increase in performance of PySpark queries when compared to non-optimized RDD queries. As shown in the following figure, prior to the introduction of DataFrames, Python query speeds were often twice as slow as the same Scala queries using RDD. Typically, this slowdown in query performance was due to the communications overhead between Python and the JVM:
With DataFrames, not only was there a significant improvement in Python ...
Get Learning PySpark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.