Advanced Web Scraping
Published by O'Reilly Media, Inc.
Scraping data from a website like Wikipedia or sports-reference.com is pretty easy. Everything is rendered with vanilla HTML/CSS, and the tag elements are predictable and well labeled.
But what if the data you need to scrape isn’t tagged properly? Or it’s locked behind behind a login page, requires clicking and scrolling to get at, or is rendered with JavaScript? What then? Most likely you will have given up and moved on... No more!
In this live training, Max will help you take your web scraping skills to the next level so that you will be better equipped for the next pesky page that you have to scrape!
What you’ll learn and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- Why some websites are harder to scrape than others
- How to scrape data that is rendered in-browser with JavaScript
- How to automate some browser tasks (like clicking and scrolling)
And you’ll be able to:
- Schedule scraping jobs on a server
- Setup notification and email triggers based on certain events
This live event is for you because...
- You already have some web scraping experience, such as by taking Web Scraping in 60 Minutes (live online training course with Max Humber)
- You want to scrape more difficult websites for personal and professional projects
- You want to learn about the latest and greatest scraping tools
Prerequisites
- Required: Experience with Python, and familiarity with BeautifulSoup
- Optional: Take Web Scraping in 60 Minutes (live online training course with Max Humber)
Recommended preparation:
- Download and install Selenium
Recommended follow-up:
- Read Web Scraping with Python, 2nd Edition (book)
- Read Learn Selenium (book)
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Introduction (5 minutes)
- Who am I, and who are you?
- Poll:
- Poll:
- Learning Agenda
Basics (5 minutes)
- A quick review on how to fetch HTML and quickly parse it
- How target HTML element tags and attributes
- Exercise: Scrape a “simple” website
Pesky Pages (15 minutes)
- How to scrape data locked behind a login page
- How to scrape data rendered with JavaScript
- Exercise: Scrape a website with login credentials
- Q&A (5 minutes)
Scheduling (10 minutes)
- How to put a scraper on a schedule
- How to send emails with scraping results
- Exercise: Schedule a scraper
Browser Automation (15 minutes)
- Replicate scrolling and browser clicks to get at hard to parse data
- How to leverage Optical Character Recognition (OCR)
- How to scrape images and other multimedia types
- Exercise: Use OCR to parse non-text text data
Conclusion + Q&A (5 minutes)
Your Instructor
Max Humber
Max Humber helps individuals, startups, Fortune 500 companies, and (sometimes) government agencies solve problems with technology. He also independently publishes apps at bracket and teaches at General Assembly.