Advanced Web Scraping

Advanced

This live event utilizes Jupyter Notebook technology

Scraping data from a website like Wikipedia or sports-reference.com is pretty easy. Everything is rendered with vanilla HTML/CSS, and the tag elements are predictable and well labeled.

But what if the data you need to scrape isn’t tagged properly? Or it’s locked behind behind a login page, requires clicking and scrolling to get at, or is rendered with JavaScript? What then? Most likely you will have given up and moved on... No more!

In this live training, Max will help you take your web scraping skills to the next level so that you will be better equipped for the next pesky page that you have to scrape!

What you’ll learn and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

Why some websites are harder to scrape than others
How to scrape data that is rendered in-browser with JavaScript
How to automate some browser tasks (like clicking and scrolling)

And you’ll be able to:

Schedule scraping jobs on a server
Setup notification and email triggers based on certain events

This live event is for you because...

You already have some web scraping experience, such as by taking Web Scraping in 60 Minutes (live online training course with Max Humber)
You want to scrape more difficult websites for personal and professional projects
You want to learn about the latest and greatest scraping tools

Prerequisites

Required: Experience with Python, and familiarity with BeautifulSoup
Optional: Take Web Scraping in 60 Minutes (live online training course with Max Humber)

Recommended preparation:

Download and install Selenium

Recommended follow-up:

Read Web Scraping with Python, 2nd Edition (book)
Read Learn Selenium (book)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction (5 minutes)

Who am I, and who are you?
Poll:
Poll:
Learning Agenda

Basics (5 minutes)

A quick review on how to fetch HTML and quickly parse it
How target HTML element tags and attributes
Exercise: Scrape a “simple” website

Pesky Pages (15 minutes)

How to scrape data locked behind a login page
How to scrape data rendered with JavaScript
Exercise: Scrape a website with login credentials
Q&A (5 minutes)

Scheduling (10 minutes)

How to put a scraper on a schedule
How to send emails with scraping results
Exercise: Schedule a scraper

Browser Automation (15 minutes)

Replicate scrolling and browser clicks to get at hard to parse data
How to leverage Optical Character Recognition (OCR)
How to scrape images and other multimedia types
Exercise: Use OCR to parse non-text text data

Conclusion + Q&A (5 minutes)

Your Instructor

Max Humber
Max Humber helps individuals, startups, Fortune 500 companies, and (sometimes) government agencies solve problems with technology. He also independently publishes apps at bracket and teaches at General Assembly.

search