Databricks Machine Learning Associate Certification Prep
Published by O'Reilly Media, Inc.
Get ready to ace the Databricks Machine Learning Associate Certification exam!
Course outcomes:
- Prepare for the Databricks Certified Machine Learning Associate exam
- Understand how to use Databricks Lakehouse Platform and its tools
- Learn how to set up ML clusters and run end-to-end ML workflows
- Discover how to use advanced ML concepts using MLflow, Spark ML, and pandas
- Understand how to put ML models into production using industry best practices
Course description:
The Databricks Machine Learning Associate certification is proof that you have a comprehensive understanding of the Databricks Lakehouse Platform and its tools, and that you possess the right skills to perform machine learning tasks. Developed by industry experts, this certification validates your skills in performing machine learning tasks and is widely recognized as the benchmark for machine learning professionals.
Join expert Dr. Yasir Khan to build a strong foundation in all the topics covered in the certification exam through hands-on experience. Upon completing this course, you'll possess expertise in utilizing Databricks machine learning, covering AutoML, feature store, and specific MLflow capabilities. You'll be able to demonstrate the ability to make precise decisions in machine learning workflows, efficiently implement them using Spark ML, and assess the advanced scaling characteristics of machine learning models.
NOTE: With today’s registration, you’ll be signed up for all four sessions. Although you can attend any of the sessions individually, we recommend participating in all four.
What you’ll learn and how you can apply it
- Use Databricks Lakehouse Platform and its tools
- Understand Databricks ML capabilities like AutoML, Feature Store, and MLflow
- Use ML workflows to carry out exploratory data analysis, feature engineering, and model training, evaluation, and selection
- Learn advanced ML concepts such as distributed ML, modeling APIs, Hyperopt, Spark ML, pandas API, and UDFs
- Learn how to scale ML models for production
This live event is for you because...
- You're a data scientist who wants to apply your skills to Databricks.
- You want to achieve a Databricks Machine Learning Associate certification.
- You're new to Databricks and want to specialize in data science using Databricks.
Prerequisites
- Have created a cloud on any of the following: Databricks on AWS, Databricks on Google, Databricks on Microsoft Azure
- Familiarity with SQL and relational databases
- A basic understanding of data science concepts and Python
Recommended preparation:
Bookmark the course Bitbucket repository (instructions for cloning the repo in your Databricks workspace will be given in the course)
Recommended follow-up:
- Watch Build an End-to-End Machine Learning Pipeline (video)
- Read Business Intelligence with Databricks SQL (book)
- Read Azure Databricks Cookbook (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Day 1: Databricks Lakehouse Architecture and Machine Learning Concepts
Introduction (70 minutes)
- Presentation: Understanding Databricks lakehouse and architecture
- Q&A
Databricks UI (35 minutes)
- Presentation and demonstration: Databricks workspace with Community Edition; with Azure Cloud; UI overview
- Q&A
- Break
Databricks machine learning key concepts (45 minutes)
- Presentation: Introduction to Databricks ML; ML cluster architecture; connection to/from Databricks repo; ML workflow orchestration
- Q&A
Databricks runtime for machine learning (40 minutes)
- Presentation: Creating Databricks ML cluster
- Hands-on exercise: Explore cluster features from UI
- Q&A
- Break
Classification, regression, and forecasting with AutoML (50 minutes)
- Presentation: Data exploration; ML workflow and evaluation metrics using AutoML
- Q&A
Day 2: Advanced ML with Databricks
Introduction to feature store (50 minutes)
- Presentation and hands-on exercise: Advantages of using feature store; creating, writing data, training and scoring a model using features from a feature store table
- Q&A
Managed MLflow (50 minutes)
- Presentation: MLflow client API, metrics, artifacts, and models in an MLflow run
- Demonstration: Creating a nested run, locating execution time and code in a run
- Hands-on exercise: Explore ML end-to-end example
- Q&A
- Break
MLflow model registry (30 minutes)
- Presentation: Registering and transitioning a model using MLflow
- Q&A
Exploratory data analysis (50 minutes)
- Presentation: Data exploration; visualization; pandas profiling
- Q&A
- Break
Feature engineering (60 minutes)
- Presentation and hands-on exercise: Feature creation; scaling; selection and transformation; missing value imputation; outlier removal; one-hot encoding; dimensionality reduction
- Q&A
Day 3: ML Workflows
Hyperparameter tuning with Hyperopt (60 minutes)
- Presentation and hands-on exercise: Introduction; hyperparameter parallelization; SparkTrials; tuning distributed training algorithms; model accuracy
- Q&A
- Break
Evaluation and selection (60 minutes)
- Presentation and hands-on exercise: Automated MLflow tracking; model fitting; cross-validation; evaluation metrics; model selection
- Q&A
- Break
Spark ML modeling APIs—classification, regression and decision trees (120 minutes)
- Presentation and demonstration: Data ingestion; split; training; evaluation; estimation; transformation pipeline using Spark ML APIs
- Q&A
Day 4: ML Scalability and Using Pandas in Databricks
Scaling ML models (40 minutes)
- Presentation: Spark scaling of linear regression; decision trees; ensemble learning; bagging, boosting, and stacking
- Q&A
Pandas on Databricks (60 minutes)
- Presentation and demonstration: Introduction; store and load data with pandas; files on Databricks; accessing data; mounting to DBFS
- Q&A
- Break
Pandas API on Spark (70 minutes)
- Presentation and demonstration: Object creation (series, DataFrame, view data, selection); grouping and plotting data; SQL in pandas API on Spark; conversion to/from PySpark DataFrame; caching
- Q&A
- Break
Pandas function APIs (20 minutes)
- Presentation: Introduction; pandas function API map; grouped map; cogrouped map
- Q&A
Pandas user-defined functions (30 minutes)
- Presentation: Introduction; Series UDF; iterator of Series UDF; iterator of multiple Series UDF; scalar UDF
- Q&A
Wrap-up (20 minutes)
- Presentation: Certification exam format, best practices, guidelines, and useful links
- Q&A
Your Instructor
Yasir Khan
Dr. Yasir Khan is the founder of 38 Labs, an Enterprise Data & AI consulting group with offices based out of Paris, New York and Bangalore. He holds a PhD in AI and is an instructor at O’Reilly Media mentoring future experts on AI transformation, machine learning, enterprise solutions and digital transformation. Over his career he has published several articles for leading publishing houses in the field of AI. He speaks at several international conferences such as PyCon, PyData, IEEE. In his spare time he likes flying aircrafts, climbing mountains and traveling.