S P L A S H

Course Content

 

ML with PySpark

Introduction 

  • Introduction to distributed computing
  • Overview of Big data environment

SPARK environment

  • Spark Architecture 
  • Resilient Distributed Datasets (RDDs)
  • Spark DataFrame 
  • Spark installation
  • Spark configuration

Machine learning on SPARK

  • Overview of machine learning
  • PySpark SQL
  • Pyspark MLlib
  • Data pipeline 

Predictive analytics

  • Linear Regression with Mlib

Classification with Mlib

  • Logistic Regression Model
  • Decision Tree Classifier
  • Random Forest Classifier
  • Gradient-Boosted Tree Classifier

Clustering 

  • Clustering - use case
  • KMeans clustering with Mlib