Chapter 12. Avoiding Scraping Traps
There are few things more frustrating than scraping a site, viewing the output, and not seeing the data that’s so clearly visible in your browser. Or submitting a form that should be perfectly fine but gets denied by the web server. Or getting your IP address blocked by a site for unknown reasons.
These are some of the most difficult bugs to solve, not only because they can be so unexpected (a script that works just fine on one site might not work at all on another, seemingly identical, site), but because they purposefully don’t have any tell tale error messages or stack traces to use. You’ve been identified as a bot, rejected, and you don’t know why.
In this book, I’ve written about a lot of ways to do tricky things on websites (submitting forms, extracting and cleaning difficult data, executing JavaScript, etc.). This chapter is a bit of a catchall in that the techniques stem from a wide variety of subjects (HTTP headers, CSS, and HTML forms, to name a few). However, they all have something in common: they are meant to overcome an obstacle put in place for the sole purpose of preventing automated web scraping of a site.
Regardless of how immediately useful this information is to you at the moment, I highly recommend you at least skim this chapter. You never know when it might help you solve a very difficult bug or prevent a problem altogether.
A Note on Ethics
In the first few chapters of this book, I discussed the legal gray area that web ...
Get Web Scraping with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.