Distributed Computing with Spark SQL
- Introduction to Spark
- In this module, you will be able to discuss the core concepts of distributed computing and be able to recognize when and where to apply them. You'll be able to identify the basic data structure of Apache Spark™, known as a DataFrame. Additionally, you will use the collaborative Databricks workspace and write SQL code that executes against a cluster of machines.
- Spark Core Concepts
- In this module, you will be able to explain the core concepts of Spark. You will learn common ways to increase query performance by caching data and modifying Spark configurations. You will also use the Spark UI to analyze performance and identify bottlenecks, as well as optimize queries with Adaptive Query Execution.
- Engineering Data Pipelines
- In this module, you will be able to identify and discuss the general demands of data applications. You'll be able to access data in a variety of formats and compare and contrast the tradeoffs between these formats. You will explore and examine semi-structured JSON data (common in big data environments) as well as schemas and parallel data writes. You will be able to create an end-to-end pipeline that reads data, transforms it, and saves the result.
- Data Lakes, Warehouses and Lakehouses
- In this module, you will identify the key characteristics of data lakes, data warehouses, and lakehouses. Lakehouses combine the scalability and low-cost storage of data lakes with the speed and ACID transactional guarantees of data warehouses. You will build a production grade lakehouse by combining Spark with the open-source project, Delta Lake. Whoever said time travel isn't possible hasn't been to a lakehouse!