The errata list is a list of errors and their corrections that were found after the product was released.
The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.
Version |
Location |
Description |
Submitted by |
Date Submitted |
PDF |
Page Chapter 1
Your first web scraper, pg 8 BeautifulSoup |
OK, problem continues w/expired SSL certificate when attempting to use urllib.request import urlopen.
Running python 3.8 in Windows 10 (64 bit) and using Idle in a virtual environment created using python -m venv blahblah.
Steps through no problem until attempting to open the page1.html which then displays:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1124)
Change to using requests library and NO issues. Issue has to be w/lack of valid SSL certificate for target web page.
So will now change to using requests library w/code from this 2nd edition book.
Will see if code changed from 1st edition which I doubt but you never know.
Will print out the listed errata so I will have a "heads up."
Thank you, Rey
|
Rey Collazo |
Mar 20, 2023 |
PDF |
Page Chapter 1
Connecting pg 5 (PDF) |
Submitted unconfirmed errata on 03/19/2023...however, forgot to mention python version (3.8.6), on Windows 10 and using both Idle and VS Code (vers 1.75).
Sorry. Getting on in age (69) - my bad 8-)
Thanks, Rey
|
Rey Collazo |
Mar 20, 2023 |
PDF |
Page Chapter 1
Connecting pg 5 (PDF) |
Will continue reading and gaining experience. Thank you, Rey
# from Web scraping w/python 2ndED O'Reilly
# book code does not work...
# had to spend time researching urlopen error + certificate has expired (_ssl.c:1124) which was not the issue.
# tried pip certifi and finally found a Stack Overflow post detailing difference
# between requests.get and urrlib.request.urlopen
# that corrected error and provided other clues
# from chp 1
# from urllib.request import urlopen
# html = urlopen('*')
# Line above results in urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]
# certificate expired
# getting to this line using requests.get results in
# Response obj has no attrib 'read'
# print(html.read())
# following works...
import requests
url = '*'
html = requests.get(url) # works
# works also but lengthy...
# html = requests.get('*')
# print(html) # rtns Response[200] but no content
print('Status code: ', html.status_code) # rtns 200
print('Content:\n ', html.text) # provides content
|
Anonymous |
Mar 19, 2023 |
Printed, PDF, Mobi |
Page xi
About This Book, the last sentence in the 2nd paragraph. |
> If you are a more advanced reader, feel free to skim these parts!
skim -> skip?
|
niki |
Feb 22, 2021 |
PDF |
Page 35
2nd paragraph |
When I ran the code block in 2nd Paragraph, I got the following error message:
AttributeError: 'NoneType' object has no attribute 'find_all'
|
Anonymous |
Aug 31, 2020 |
PDF |
Page 34
2nd paragraph |
When I ran the code in page 34 2nd Paragraph. I got the error message as follows:
AttributeError: 'NoneType' object has no attribute 'find_all'
|
Barrick Chang |
Aug 31, 2020 |
Printed |
Page 9
Last code example |
Need to import HTTPError like this:
from urllib.error import HTTPError
in order to use HTTPError handler.
This is only shown in the code sample 2 pages later, and never explicitly mentioned.
|
STAVROS MACRAKIS |
May 26, 2020 |
ePub |
Page 4
9 |
bsObj = BeautifulSoup( html)
Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 423-424). O'Reilly Media. Kindle Edition.
The call to BeautifulSoup on Windows Python3 from Anadonca3 can produce an error if the webpage has a non-ASCII character. It can be fixed deep down in the Python code, but better to warn the user that at least in this setting, you can get a character error.
|
Clifford Ireland |
Jul 02, 2019 |
Printed |
Page 76
Code at top of page |
Code for storing data into a csv file pages 75-76. Code with line "writer.writerow(csvRow)" is indented when it should not be. As code is printed in book, this line of code writes to the csv file every time the nested for loop is ran. Instead I believe it should be 'unindented' to only write to the csv file when "csvRow" has the entire row of information.
|
AV |
Apr 25, 2019 |
PDF |
Page 93
4th paragraph |
timestamp column
should be:
created column
|
Ron ter Borg |
Dec 27, 2018 |
PDF |
Page 73
last paragrafh before code |
Using two separate Rule and LinkExtractor classes with a single parsing function...
This is not correct. Rule is a function and not a class.
|
Ron ter Borg |
Dec 23, 2018 |
PDF |
Page 64-65
end 64, begin 65 |
The classes Website and Webpage (and hence the derived subclasses) have been used unthoughtfully.
I think all classes should be Webpage and the subclasses Product and Article should extend from Webpage.
|
Ron ter Borg |
Dec 22, 2018 |
PDF |
Page 64
6th paragraph |
If the pages are all similar (they all have basically the same types of content), you may want to add a pageType attribute to your existing web-page object:
class Website:
"""Common base class for all articles/pages"""
-------------------------------
Should the class not be named: class Webpage ?
|
Ron ter Borg |
Dec 22, 2018 |
Printed |
Page 30
The section with code - the regex |
I think that for a regex newbie like me, it would be nice if the regex was consistent. If the forward slashes don't need to be escaped then why are they.
I am just really confused about the forward slashes.
Thanks for your help.
What I posted in a regex course:
I was reading a book about python web scraping and then referenced the quick guide to regex and it seems to me that if I want to find the following pattern:
../img/gifts/img1.jpg, ../img/gifts/img2.jpg etc.
The expression should really be:
'\.\.\/img\/gifts\/img.*\.jpg' right?
Wondering if I am missing something.
The response I got:
sure Ray. Looks good. You can further tighten the constraints (instead of wildcard *) by explicitly looking for digits. You don't need to escape forward slashes.
\.\./img/gifts/img\d{1,}\.jpg
|
Anonymous |
May 09, 2018 |
Printed |
Page 35
First Paragraph through code midway down |
The issue about the colon vs. semicolon in other errata is moot: colons do appear in valid links in the Kevin Bacon wikipedia page. Examples include:
/wiki/Tremors_5:_Bloodline
/wiki/X-Men:_First_Class
Therefore the regular expression seeking to exclude all non-content links and include only content links excludes at least two content links.
|
Anonymous |
May 04, 2018 |
Printed, PDF |
Page 33
2nd paragraph |
Second paragraph states:
"The URLs do not contain semicolons"
Line six of following example code:
for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a",
href=re.compile("^(/wiki/)((?!:).)*$")):
should be:
for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a",
href=re.compile("^(/wiki/)((?!;).)*$")):
|
JR |
Oct 04, 2017 |
PDF |
Page 33
2nd bullet point |
quote: " The URLs do not contain semicolons"
should be: The URLs do not contain colons
|
Anonymous |
Jul 20, 2017 |
PDF |
Page 28
under the section Lambda Expressions |
quote:
"BeautifulSoup allows us to pass certain types of functions as parameters into the findAll function. The only restriction is that these functions must take a tag object as an argument and return a boolean.
Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to “true” are returned while the rest are discarded.
soup.findAll(lambda tag: len(tag.attrs) == 2)"
---
it says the types of functions that can be passed into findAll() must return boolean. but the example len() you used doesn't return boolean but INT. i think you meant for the condition to evaluate to true instead of the inner function to return boolean true but the phrasing can be clearer.
|
Anonymous |
Jul 13, 2017 |
PDF |
Page 32
Code example on the page |
When the book instructs you to build a simple scraper to find Kevin Bacon's film history, the book does not take into account that Wikipedia has blocked this type of crawler.
This example on page 32 is impossible to follow along with because wikipedia now requests SSL access from the crawler.
|
Dockmann |
Jun 01, 2017 |
|
Chapter 2
Table 2-1, the meaning section for $ |
Simply missing the letter 'h' in the word 'thought'. Text: "This can be thougt of as analogous to the ^ symbol."
|
Devin |
Mar 12, 2017 |
PDF |
Page 10
2nd paragraph below the codes |
The book says
"If the server is not found at all (if, say, http://www.pythonscraping.com was down, or the URL was mistyped), urlopen returns a None object.", but actually urlopen never returns a None object.
In "The Python Standard Library" documentation, the introduction of urlopen says "Note that None may be returned if no handler handles the request (though the default installed global OpenerDirector uses UnknownHandler to ensure this never happens)." (https://docs.python.org/3/library/urllib.request.html#module-urllib.request).
So, instead of returning a None object, it will raise a URLError when the server is not found, as the documentation says "Raises URLError on protocol errors.".
In my test, if the URL is mistyped intentionally, it returns "URLError: <urlopen error [Errno 11001] getaddrinfo failed>".
The contents in this chapter introduce only the HTTPError (a subclass of URLError), but omitting the URLError. It seems incomplete.
|
Anonymous |
Jul 17, 2016 |