Big Data, Hadoop, and Spark Basics

Por: edX . en: , ,

Module 1 – What is Big Data?

___Introduction to Big Data_ *

o What is Big Data?

o Impact of Big Data

o Parallel Processing, Scaling, and Data Parallelism

o Tools of Big Data

o Beyond the Hype

o Big Data Use Cases

o Viewpoints about Big Data

Module 2 – Introduction to the Hadoop Ecosystem

___Introduction to the Hadoop Ecosystem_ *

o What is Hadoop

o An introduction to MapReduce

o The Hadoop Ecosystem/Common components: Introducing HDFS, Hive, HBase, and Spark, other modules

o Working with HDFS

o Working with HBase

o Lab: MapReduce

Module 3 – Introduction to Apache Spark

___Introduction to Apache Spark_ *

o Why use Apache Spark?

o Functional Programming Basics

o Parallel Programming using Resilient Distributed Datasets

o Scale-out / Data Parallelism in Apache Spark

o DataFrames and SparkSQL

o Lab: Practical examples with PySpark

Module 4 – DataFrames and SparkSQL

___DataFrames and SparkSQL_ *

o Introduction to Data-Frames & SparkSQL

o RDDs in Parallel Programming and Spark

o Data-frames and Datasets

o Catalyst and Tungsten

o ETL with Data-frames

o Lab: ETL with Data-frames

o Real-world usage of SparkSQL

o Lab: SparkSQL

Module 5 – Development and Runtime Environment options

___Development and Runtime Environment options_ *

o Apache Spark architecture

o Overview of Apache Spark Cluster Modes

o How to Run an Apache Spark Application

o Using Apache Spark on IBM Cloud

o Lab: Scale-out on IBM Spark Environment in Watson Studio

o Setting Apache Spark Configuration

o Running Spark on Kubernetes

o Lab: Spark on Kube

Module 6 – Monitoring & Tuning

___Monitoring and tuning Apache Spark_ *

o The Apache Spark User Interface

o Monitoring Jobs

o Debugging of parallel jobs

o Understanding Memory resources

o Understanding Processor resources

o Lab: Monitoring and Performance tuning

Module 7 – Final Quiz ****