Web Scraping with Python, 3rd Edition

Book description

If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web.

Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter.

  • Parse complicated HTML pages
  • Develop crawlers with the Scrapy framework
  • Learn methods to store the data you scrape
  • Read and extract data from documents
  • Clean and normalize badly formatted data
  • Read and write natural languages
  • Crawl through forms and logins
  • Scrape JavaScript and crawl through APIs
  • Use and write image-to-text software
  • Avoid scraping traps and bot blockers
  • Use scrapers to test your website

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. What Is Web Scraping?
    2. Why Web Scraping?
    3. About This Book
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  2. I. Building Scrapers
  3. 1. How the Internet Works
    1. Networking
      1. Physical Layer
      2. Data Link Layer
      3. Network Layer
      4. Transport Layer
      5. Session Layer
      6. Presentation Layer
      7. Application Layer
    2. HTML
    3. CSS
    4. JavaScript
    5. Watching Websites with Developer Tools
  4. 2. The Legalities and Ethics of Web Scraping
    1. Trademarks, Copyrights, Patents, Oh My!
      1. Copyright Law
    2. Trespass to Chattels
    3. The Computer Fraud and Abuse Act
    4. robots.txt and Terms of Service
    5. Three Web Scrapers
      1. eBay v. Bidder’s Edge and Trespass to Chattels
      2. United States v. Auernheimer and the Computer Fraud and Abuse Act
      3. Field v. Google: Copyright and robots.txt
  5. 3. Applications of Web Scraping
    1. Classifying Projects
    2. E-commerce
      1. Marketing
    3. Academic Research
    4. Product Building
    5. Travel
    6. Sales
    7. SERP Scraping
  6. 4. Writing Your First Web Scraper
    1. Installing and Using Jupyter
    2. Connecting
    3. An Introduction to BeautifulSoup
      1. Installing BeautifulSoup
      2. Running BeautifulSoup
      3. Connecting Reliably and Handling Exceptions
  7. 5. Advanced HTML Parsing
    1. Another Serving of BeautifulSoup
      1. find() and find_all() with BeautifulSoup
      2. Other BeautifulSoup Objects
      3. Navigating Trees
    2. Regular Expressions
    3. Regular Expressions and BeautifulSoup
    4. Accessing Attributes
    5. Lambda Expressions
    6. You Don’t Always Need a Hammer
  8. 6. Writing Web Crawlers
    1. Traversing a Single Domain
    2. Crawling an Entire Site
      1. Collecting Data Across an Entire Site
    3. Crawling Across the Internet
  9. 7. Web Crawling Models
    1. Planning and Defining Objects
    2. Dealing with Different Website Layouts
    3. Structuring Crawlers
      1. Crawling Sites Through Search
      2. Crawling Sites Through Links
      3. Crawling Multiple Page Types
    4. Thinking About Web Crawler Models
  10. 8. Scrapy
    1. Installing Scrapy
      1. Initializing a New Spider
    2. Writing a Simple Scraper
    3. Spidering with Rules
    4. Creating Items
    5. Outputting Items
    6. The Item Pipeline
    7. Logging with Scrapy
    8. More Resources
  11. 9. Storing Data
    1. Media Files
    2. Storing Data to CSV
    3. MySQL
      1. Installing MySQL
      2. Some Basic Commands
      3. Integrating with Python
      4. Database Techniques and Good Practice
      5. “Six Degrees” in MySQL
    4. Email
  12. II. Advanced Scraping
  13. 10. Reading Documents
    1. Document Encoding
    2. Text
      1. Text Encoding and the Global Internet
    3. CSV
      1. Reading CSV Files
    4. PDF
    5. Microsoft Word and .docx
  14. 11. Working with Dirty Data
    1. Cleaning Text
    2. Working with Normalized Text
    3. Cleaning Data with Pandas
      1. Cleaning
      2. Indexing, Sorting, and Filtering
      3. More About Pandas
  15. 12. Reading and Writing Natural Languages
    1. Summarizing Data
    2. Markov Models
      1. Six Degrees of Wikipedia: Conclusion
    3. Natural Language Toolkit
      1. Installation and Setup
      2. Statistical Analysis with NLTK
      3. Lexicographical Analysis with NLTK
    4. Additional Resources
  16. 13. Crawling Through Forms and Logins
    1. Python Requests Library
    2. Submitting a Basic Form
    3. Radio Buttons, Checkboxes, and Other Inputs
    4. Submitting Files and Images
    5. Handling Logins and Cookies
      1. HTTP Basic Access Authentication
    6. Other Form Problems
  17. 14. Scraping JavaScript
    1. A Brief Introduction to JavaScript
      1. Common JavaScript Libraries
    2. Ajax and Dynamic HTML
    3. Executing JavaScript in Python with Selenium
      1. Installing and Running Selenium
      2. Selenium Selectors
      3. Waiting to Load
      4. XPath
    4. Additional Selenium WebDrivers
    5. Handling Redirects
    6. A Final Note on JavaScript
  18. 15. Crawling Through APIs
    1. A Brief Introduction to APIs
      1. HTTP Methods and APIs
      2. More About API Responses
    2. Parsing JSON
    3. Undocumented APIs
      1. Finding Undocumented APIs
      2. Documenting Undocumented APIs
    4. Combining APIs with Other Data Sources
    5. More About APIs
  19. 16. Image Processing and Text Recognition
    1. Overview of Libraries
      1. Pillow
      2. Tesseract
      3. NumPy
    2. Processing Well-Formatted Text
      1. Adjusting Images Automatically
      2. Scraping Text from Images on Websites
    3. Reading CAPTCHAs and Training Tesseract
      1. Training Tesseract
    4. Retrieving CAPTCHAs and Submitting Solutions
  20. 17. Avoiding Scraping Traps
    1. A Note on Ethics
    2. Looking Like a Human
      1. Adjust Your Headers
      2. Handling Cookies with JavaScript
      3. TLS Fingerprinting
      4. Timing Is Everything
    3. Common Form Security Features
      1. Hidden Input Field Values
      2. Avoiding Honeypots
    4. The Human Checklist
  21. 18. Testing Your Website with Scrapers
    1. An Introduction to Testing
      1. What Are Unit Tests?
    2. Python unittest
      1. Testing Wikipedia
    3. Testing with Selenium
      1. Interacting with the Site
  22. 19. Web Scraping in Parallel
    1. Processes Versus Threads
    2. Multithreaded Crawling
      1. Race Conditions and Queues
      2. More Features of the Threading Module
    3. Multiple Processes
      1. Multiprocess Crawling
      2. Communicating Between Processes
    4. Multiprocess Crawling—Another Approach
  23. 20. Web Scraping Proxies
    1. Why Use Remote Servers?
      1. Avoiding IP Address Blocking
      2. Portability and Extensibility
    2. Tor
      1. PySocks
    3. Remote Hosting
      1. Running from a Website-Hosting Account
      2. Running from the Cloud
      3. Moving Forward
    4. Web Scraping Proxies
      1. ScrapingBee
      2. ScraperAPI
      3. Oxylabs
      4. Zyte
    5. Additional Resources
  24. Index
  25. About the Author

Product information

  • Title: Web Scraping with Python, 3rd Edition
  • Author(s): Ryan Mitchell
  • Release date: February 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098145354