Hands-On Web Scraping with Python - Second Edition

Book description

Work through practical examples to unlock the full potential of web scraping with Python and gain valuable insights from high-quality data

Key Features

  • Build an initial portfolio of web scraping projects with detailed explanations
  • Grasp Python programming fundamentals related to web scraping and data extraction
  • Acquire skills to code web scrapers, store data in desired formats, and employ the data professionally
  • Purchase of the print or Kindle book includes a free PDF eBook

Book Description

Web scraping is a powerful tool for extracting data from the web, but it can be daunting for those without a technical background. Designed for novices, this book will help you grasp the fundamentals of web scraping and Python programming, even if you have no prior experience.

Adopting a practical, hands-on approach, this updated edition of Hands-On Web Scraping with Python uses real-world examples and exercises to explain key concepts. Starting with an introduction to web scraping fundamentals and Python programming, you’ll cover a range of scraping techniques, including requests, lxml, pyquery, Scrapy, and Beautiful Soup. You’ll also get to grips with advanced topics such as secure web handling, web APIs, Selenium for web scraping, PDF extraction, regex, data analysis, EDA reports, visualization, and machine learning.

This book emphasizes the importance of learning by doing. Each chapter integrates examples that demonstrate practical techniques and related skills. By the end of this book, you’ll be equipped with the skills to extract data from websites, a solid understanding of web scraping and Python programming, and the confidence to use these skills in your projects for analysis, visualization, and information discovery.

What you will learn

  • Master web scraping techniques to extract data from real-world websites
  • Implement popular web scraping libraries such as requests, lxml, Scrapy, and pyquery
  • Develop advanced skills in web scraping, APIs, PDF extraction, regex, and machine learning
  • Analyze and visualize data with Pandas and Plotly
  • Develop a practical portfolio to demonstrate your web scraping skills
  • Understand best practices and ethical concerns in web scraping and data extraction

Who this book is for

This book is for beginners who want to learn web scraping and data extraction using Python. No prior programming knowledge is required, but a basic understanding of web-related concepts such as websites, browsers, and HTML is assumed. If you enjoy learning by doing and want to build a portfolio of web scraping projects and delve into data-related studies and application, then this book is tailored for your needs.

Table of contents

  1. Hands-On Web Scraping with Python
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Share Your Thoughts
    8. Download a free PDF copy of this book
  6. Part 1:Python and Web Scraping
  7. Chapter 1: Web Scraping Fundamentals
    1. Technical requirements
    2. What is web scraping?
    3. Understanding the latest web technologies
      1. HTTP
      2. HTML
      3. XML
      4. JavaScript
      5. CSS
    4. Data-finding techniques used in web pages
      1. HTML source page
      2. Developer tools
    5. Summary
    6. Further reading
  8. Chapter 2: Python Programming for Data and Web
    1. Technical requirements
    2. Why Python (for web scraping)?
    3. Accessing the WWW with Python
      1. Setting things up
      2. Creating a virtual environment
      3. Installing libraries
      4. Loading URLs
    4. URL handling and operations
      1. requests – Python library
    5. Implementing HTTP methods
      1. GET
      2. POST
    6. Summary
    7. Further reading
  9. Part 2:Beginning Web Scraping
  10. Chapter 3: Searching and Processing Web Documents
    1. Technical requirements
    2. Introducing XPath and CSS selectors to process markup documents
      1. The Document Object Model (DOM)
      2. XPath
      3. CSS selectors
    3. Using web browser DevTools to access web content
      1. HTML elements and DOM navigation
      2. XPath and CSS selectors using DevTools
    4. Scraping using lxml – a Python library
      1. lxml by example
      2. Web scraping using lxml
    5. Parsing robots.txt and sitemap.xml
      1. The robots.txt file
      2. Sitemaps
    6. Summary
    7. Further reading
  11. Chapter 4: Scraping Using PyQuery, a jQuery-Like Library for Python
    1. Technical requirements
    2. PyQuery overview
      1. Introducing jQuery
    3. Exploring PyQuery
      1. Installing PyQuery
      2. Loading a web URL
      3. Element traversing, attributes, and pseudo-classes
      4. Iterating using PyQuery
    4. Web scraping using PyQuery
      1. Example 1 – scraping book details
      2. Example 2 – sitemap to CSV
      3. Example 3 – scraping quotes with author details
    5. Summary
    6. Further reading
  12. Chapter 5: Scraping the Web with Scrapy and Beautiful Soup
    1. Technical requirements
    2. Web parsing using Python
      1. Introducing Beautiful Soup
      2. Installing Beautiful Soup
      3. Exploring Beautiful Soup
    3. Web scraping using Beautiful Soup
    4. Web scraping using Scrapy
      1. Setting up a project
      2. Creating an item
      3. Implementing the spider
      4. Exporting data
    5. Deploying a web crawler
    6. Summary
    7. Further reading
  13. Part 3:Advanced Scraping Concepts
  14. Chapter 6: Working with the Secure Web
    1. Technical requirements
    2. Exploring secure web content
      1. Form processing
      2. Cookies and sessions
      3. User authentication
    3. HTML <form> processing using Python
    4. User authentication and cookies
    5. Using proxies
    6. Summary
    7. Further reading
  15. Chapter 7: Data Extraction Using Web APIs
    1. Technical requirements
    2. Introduction to web APIs
      1. Types of API
      2. Benefits of web APIs
    3. Data formats and patterns in APIs
      1. Example 1 – sunrise and sunset
      2. Example 2 – GitHub emojis
      3. Example 3 – Open Library
    4. Web scraping using APIs
      1. Example 1 – holidays from the US calendar
      2. Example 2 – Open Library book details
      3. Example 3 – US cities and time zones
    5. Summary
    6. Further reading
  16. Chapter 8: Using Selenium to Scrape the Web
    1. Technical requirements
    2. Introduction to Selenium
      1. Advantages and disadvantages of Selenium
      2. Use cases of Selenium
      3. Components of Selenium
    3. Using Selenium WebDriver
      1. Setting things up
      2. Exploring Selenium
    4. Scraping using Selenium
      1. Example 1 – book information
      2. Example 2 – forms and searching
    5. Summary
    6. Further reading
  17. Chapter 9: Using Regular Expressions and PDFs
    1. Technical requirements
    2. Overview of regex
    3. Regex with Python
      1. re (search, match, and findall)
      2. re.split
      3. re.sub
      4. re.compile
      5. Regex flags
    4. Using regex to extract data
      1. Example 1 – Yamaha dealer information
      2. Example 2 – data from sitemap
      3. Example 3 – Godfrey’s dealer
    5. Data extraction from a PDF
      1. The PyPDF2 library
      2. Extraction using PyPDF2
    6. Summary
    7. Further reading
  18. Part 4:Advanced Data-Related Concepts
  19. Chapter 10: Data Mining, Analysis, and Visualization
    1. Technical requirements
    2. Introduction to data mining
      1. Predictive data mining
      2. Descriptive data mining
    3. Handling collected data
      1. Basic file handling
      2. JSON
      3. CSV
      4. SQLite
    4. Data analysis and visualization
      1. Exploratory Data Analysis using ydata_profiling
      2. pandas and plotly
    5. Summary
    6. Further reading
  20. Chapter 11: Machine Learning and Web Scraping
    1. Technical requirements
    2. Introduction to ML
      1. ML and Python programming
      2. Types of ML
    3. ML using scikit-learn
      1. Simple linear regression
      2. Multiple linear regression
      3. Sentiment analysis
    4. Summary
    5. Further reading
  21. Part 5:Conclusion
  22. Chapter 12: After Scraping – Next Steps and Data Analysis
    1. Technical requirements
    2. What happens after scraping?
    3. Web requests
      1. pycurl
      2. Proxies
    4. Data processing
      1. PySpark
      2. polars
    5. Jobs and careers
    6. Summary
    7. Further reading
  23. Index
    1. Why subscribe?
  24. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Download a free PDF copy of this book

Product information

  • Title: Hands-On Web Scraping with Python - Second Edition
  • Author(s): Anish Chapagain
  • Release date: October 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781837636211