Data Science Bootcamp with Python, Pandas, and Plotly

Beginner to intermediate

Wrangling, exploring, visualizing, and simple modeling

This live event utilizes Jupyter Notebook technology

In this course you’ll:

Use the pandas library to query and manipulate data
Recognize common problems with real-world data, such as missing values, incorrect formats, and unhelpful structure
Create production-ready, interactive plots using the Plotly library
Apply best practices in data visualization to facilitate meaningful comparisons and incorporate the data design
Utilize data visualizations to explore data, uncover relationships, and spot data issues

You have a dataset and want to discover new insights. Now what? You’ll start by wrangling the data—checking the data quality, looking for missing values, and fixing the structure. You’ll also explore the data to find patterns and relationships between data features. When it comes to communicating your findings, you’ll apply best practices from data visualization to create clear and useful plots.

Join experts Sam Lau, Joey Gonzalez, and Deb Nolan to get up to speed on these essential steps in data analysis using Python. You’ll learn the industry standard pandas library for data wrangling and the popular Plotly library for creating interactive, polished plots. You’ll write and debug code through a series of carefully chosen case studies ranging from San Francisco food safety violations to air quality data. You’ll also learn the underlying principles: what data patterns mean and why some plots are better than others. After completing this course, you’ll not only be able to clean, explore, and visualize data with Python, but you will also be able to explain and justify your analyses to others.

Week 1: Pandas and Data Cleaning

Week 2: Exploratory Data Analysis

Week 3: Principles of Data Visualization

Week 4: Case Study and Modeling Teaser

NOTE: With today’s registration, you’ll be signed up for all four sessions. Although you can attend any of the sessions individually, we recommend participating in all four weeks.

What you’ll learn and how you can apply it

Understand methods (e.g., how to write code) but also higher-level principles (e.g., how to design an effective plot)

This live event is for you because...

You’re a developer, analyst, or data scientist working in Python.
You want to learn how to use the pandas library to work with data.
You want to be able to make high-quality plots using the latest technology.
You want to use plots to make decisions using data.

Prerequisites

Proficiency in Python

Recommended preparation:

Take Python Functions (live online course with Noah Gift)
Read Learning Data Science (book)

Recommended follow-up:

Take Python Programming for Data Analysis in 5 Weeks (live online course with Reuven Lerner)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Week 1: Pandas and Data Cleaning

Introduction to course (20 minutes)

Presentation: Overview of course, technology, and what you’ll learn
Hands-on exercise: Set up JupyterLab for coursework
Q&A

Pandas fundamentals (80 minutes)

Presentation: Subsetting, aggregating, joining, transforming data using pandas (based on chapter 6 of Learning Data Science)
Hands-on exercise: Practice with pandas on baby names dataset
Q&A
Break

Data wrangling (80 minutes)

Presentation: Quality checks; how to work with missing values; reshaping; transforming; timestamps (based on chapter 9)
Hands-on exercise: Explore case study on restaurant food safety violations
Q&A

Week 2: Exploratory Data Analysis

Basic data visualization using Plotly (90 minutes)

Presentation: What is EDA?; feature types; plotting using Plotly (based on 11.6 and 10.1–10.3)
Hands-on exercise: Visualize and explore American Kennel Club data
Q&A
Break

Fundamentals of EDA (90 minutes)

Presentation: Uncovering patterns in data visualizations; using facets and small multiples (based on 10.4–10.5)
Hands-on exercise: Explore case study on housing prices
Q&A

Week 3: Principles of Data Visualization

Scale, smoothing, and comparisons (90 minutes)

Presentation: Why we have guidelines for data visualizations; choosing scale to reveal structure; smoothing and aggregating for large datasets; facilitating meaningful comparisons (based on 11.1–11.3)
Hands-on exercise: Apply data visualization principles on datasets
Q&A
Break

Data design and context (90 minutes)

Presentation: Time series data; observational studies; geographic data; informative titles and annotations (based on 11.4–11.6)
Hands-on exercise: Explore case study on housing prices
Q&A

Week 4: Case Study and Modeling Teaser

Introduction to case study (15 minutes)

Presentation: Air quality sensors; government sensors versus PurpleAir; why this study matters (based on 12.0)
Q&A

Exploring and visualizing sensor data (120 minutes)

Presentation: Loading the AQS and PurpleAir data; data wrangling; merging the data; data visualization (based on 12.1–12.3)
Break
Hands-on exercise: Explore case study on data wrangling, exploring, and visualizing
Q&A
Break

Modeling the sensor data (45 minutes)

Presentation: Linear models; using models for calibration; the final model result (based on 12.4)
Hands-on exercise: Use scikit-learn to fit linear model for calibration
Q&A

Your Instructors

Sam Lau
Sam Lau is a PhD candidate at UC San Diego and coauthor of Learning Data Science. He designs novel interfaces for learning and teaching data science, and his research has been published in top-tier conferences in human-computer interaction and end-user programming. Sam instructed and helped design flagship data science courses at UC Berkeley that have grown to serve thousands of students every year.

search
Joey Gonzalez
Joseph Gonzalez is an Associate Professor at UC Berkeley and a founding member of the Berkeley Sky and RISE labs where he studies the design of next generation cloud systems and systems for high-performance machine learning. The RISE Lab is an NSF Expedition center, and both the Sky and RISE Labs are backed by a consortium of leading international industrial sponsors. His research addresses problems in data systems, neural network design, compilers and distributed systems for large scale machine learning, natural language processing computer vision, robotics, autonomous driving, and graph analytics. Gonzalez co-led the development and teaches the large upper-division data science class (Data100). Outside of Berkeley, Gonzalez is co-founder and VP of product at Aqueduct Inc. Prior to joining Berkeley, Gonzalez co-founded Turi Inc (formerly GraphLab) based on his thesis work and created the GraphX project (now part of Apache Spark). Gonzalez’s innovative work has earned him significant recognition, including the Okawa Research Grant, the NSF Expedition Award, and the NSF Early CAREER Award.

linkedin link search
Deborah Nolan
Deborah Nolan's work with CDSS goes back to the beginnings of the data science major at Berkeley, including co-developing the course Data 100, Principles and Techniques of Data Science. Most recently, Deb served as Associate Dean for Data Science Undergraduate Studies from January 2020 until her retirement in June 2021. Deb has previously served as the chair of the Department of Statistics (2017-19, 2003-06), associate dean of the Division of Mathematical and Physical Sciences (2006-13) and interim dean of that division (2009).

Deb is widely recognized for her innovation in pedagogy and for developing programs to encourage students from all backgrounds to learn statistics and data science. She has co-authored several texts: “Stat Labs” with Terry Speed; “Teaching Statistics” with Andrew Gelman; “Data Science in R” with Duncan Temple Lang; and “Communicating with Data” with Sara Stoudt. Deb co-developed and ran the Berkeley Summer Math Institute and Explorations in Statistics Research, both summer programs for undergraduates. She also designed Berkeley Unboxing Data Science, a summer program for high school students. Deb is also a founding co-director of CalTeach, a teacher-training program for STEM majors at Berkeley.

Deb is the recipient of Berkeley’s Distinguished Teaching Award, the Chancellor’s Award in Public Service, and the American Statistical Association Waller Distinguished Teaching Career Award. She held the Zaffaroni Family Chair in Undergraduate Education at Berkeley.

search