Apache Spark has changed dramatically in the past year – from new APIs in Spark 1.4 to dramatic execution improvements and even better APIs in 2.0. In this intermediate-level tutorial, I'll address the question of which Spark APIs to use with a series of brief technical explanations and demos that highlight best practices, latest APIs, and new features.
We'll look at how Dataset and DataFrame behave in Spark 2.0, look at Whole-Stage Code Generation, and go through a simple example of Spark 2.0 Structured Streaming (Streaming with DataFrames) that you can run in your own free instance of Databricks.
- Intro: What is "Modern Spark"
- Why not use RDD?
- Intro to DataFrame and Dataset
- DataFrame versus Dataset
- Dataset Queries and Dataset with Scala classes
- Spark Query Optimizer
- Whole-Stage Codegen
- Hive integration
- Wrapping Up DataFrame/Dataset Benefits
- "One More Thing" - Structured Streaming
Spark Training from NewCircle
If you're just getting started with Spark development, check out our 3 day Spark Programming course page to see upcoming public classes or request an onsite training for your team.