Programming for Apache Spark 2.1
Programming for Apache Spark 2.1
This 3 day training course will teach you how to harness Apache Spark 2.1 for large scale data analysis, building big data applications and data processing pipelines. You will learn how to program Spark as efficiently and effectively as possible, by targeting the latest version of the platform (Spark 2.1), and learning the modern approach necessary to fully leverage the advantages it offers.
The entirety of the course is taught hands-on, using real code and interactive examples. In addition, longer labs allow attendees to work together to apply their growing Spark knowledge to solve common challenges faced by organizations running complex Big Data applications in production.
Focus on Open Source Technology, Not Commercial Products
While we’re enthusiastic about many of the products in the Big Data ecosystem, the focus of this training course is to make you as proficient and effective as possible with open source Apache Spark, enabling you to apply the fundamental skills gained to whichever products and tools work best for you.
Learn to Program for Apache Spark 2.1 – and the future of the Spark platform.
Targeting the latest version of the Spark platform, Apache Spark 2.1, will teach you how to optimize your Spark code to fully leverage the internal changes that make Spark 2.1 faster and more effective. At the same time, this training course will help prepare you for the future of the platform, by teaching you the modern approach to Spark programming required by future releases of the platform.
- Program Apache Spark in the most performant, easy, modern, and effective ways possible to perform ETL, analytics, machine learning, and streaming operations.
- Understand how Spark should – and shouldn't! – be used within your Big Data application architectures.
- Learn how Apache Spark processes your jobs so that you can troubleshoot, analyze, and improve performance if they don't run well.
- See important patterns, tricks, tips, and gotchas so that you don't have to learn them the hard way.
- Apache Spark Fundamentals and Background
- Compute Model
- APIs, Use Cases, and Ecosystem
- Differences from MapReduce
- Core Architecture
DataFrame/Dataset and SQL Analytics
- Overview of concepts and APIs
- Using SQL with Spark
- Data import/export, formats
- Parallelism and UI basics
- DataFrame operators, columns
- Hive integration (optional)
- Solving analytics problems with DataFrames
- DataFrame/Dataset vs. RDDs
Machine Learning Overview
- Understanding Apache Spark ML API Patterns
- Basics of Transformers, Estimators, and Pipelines
- Simple Linear Regression
- Apache Spark Structured Streaming with DataFrames
- Patterns and I/O Considerations
RDDs and Deep Dive Part 1
- RDD concept, partitioning, APIs
- Caching and persistence
- DAG and control flow
- Job execution: How does Spark use a cluster to run your jobs?
- Performance/Troubleshooting: Is my job running well? Improving execution
Catalyst/Tungsten and Deep Dive Part 2
- Apache Spark’s query optimizer
- Encoders and native memory
- How Apache Spark converts Dataset operations to RDD/DAG jobs
- Understanding the Jobs, Stages, and Tasks of DataFrame/SQL execution
- Performance Optimizations (and Gotchas) of Spark
- Broadcasts and Broadcast Joins
- Cluster manager options, pros/cons
- Patterns for submitting jobs to Apache Spark or exposing Apache Spark as a service
- Spark standalone clustering, YARN, and Mesos deployment
- Beyond standalone analytics or ETL: Integrating Spark services into your Architecture
Apache Spark Streaming in Depth
- Streaming processing models: Receiver-based, Receiverless, Structured Streaming (2.x)
- APIs for Streaming Logic
- Streaming Integration Patterns
- Monitoring and Tuning for Streaming
- Reliability and Recovery for "Always-On" Apps
- General explanation of predictive analytics / ML (optional)
- SparkML feature coverage and integration with other ML toolkits/systems
- In-Depth ML Pipelines API examples
- Training models
- Evaluating models
- Tuning models (cross-validation, hyperparameter search)
- Deploying models to production