Errata

Web Scraping with Python

Errata for Web Scraping with Python

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Printed, PDF, ePub
last code block

Running the table to csv code (to turn a wikipedia table into a csv file) only captures the
headers. The cells aren't filled with anything.

Note from the Author or Editor:
Formatting error causes inner "for" loop to be outdented, causing the logic in the code to break. The code on Github is correct: https://github.com/REMitchell/python-scraping/blob/master/chapter5/3-scrapeCsv.py

Anonymous  Jul 26, 2015  Oct 30, 2015
ePub

Traceback (most recent call last):
File "/home/dave/python/scrape_add_to_db.py", line 28, in <module>
links = getLinks("/wiki/Kevin_Bacon")
File "/home/dave/python/scrape_add_to_db.py", line 22, in getLinks
title = bsObj.find("h1").find("span").get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

I'm pretty sure that the error "None" means some problem downloading the url, but I know that I got pymysql working and changed my character sets. I thought that kindle might have mangled your nice code again so I went to github and copied and pasted the code, still same error. This is chapter 5 about 34% into the book (no page number on Kindle).

Note from the Author or Editor:
Unfortunately, Wikipedia has removed span tags from its titles, breaking some of the code in the book. This can be fixed by removing "find("span")" from the code, and just writing:
title = bsObj.find("h1").get_text()

This will be fixed in ebook editions and updated for future print editions.

Anonymous  Jul 27, 2015  Oct 30, 2015
ePub, Mobi, , Other Digital Version
Chapter 8 Reading and Writing Natural Languages; Kindle Locations 3344-3345;

missing the 's' in word bigramsDist in the line of code: bigramDist[("Sir", "Robin")]

Note from the Author or Editor:
Good catch! Have fixed for upcoming prints/ebook releases.

golfpsy101  Aug 11, 2015  Oct 30, 2015
Other Digital Version
Chapter 8 Reading and Writing Natural Languages; Kindle Locations 3401-3406

The text coloring is not consistent for the string in the line of code:

text = word_tokenize("Strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.")

Note from the Author or Editor:
Will be fixed in the ebook and upcoming printings of the book.

golfpsy101  Aug 11, 2015  Oct 30, 2015
ePub
Page 2
Chapter 2

In chapter 2, "Advanced HTML Parsing",I've found the following two errors:
(1) in the section titled "A Caveat to the keyword Argument", there is a sentence that begins with 'Alternatively, you can enclose class in quotes'. The sample code that follows 'bsObj.findall("", {"class":"green"}' is missing the right parenthesis.

(2) Once again in chapter 2, "Advanced HTML Parsing", in the section titled "Other BeautifulSoup Objects" there is a sentence that is indented under "Tag objects" that ends in a colon (':'). The colon, traditionally and grammatically, signals that additional information follows but none does (follow). Is this an grammar typo or is the text that follows the colon actually missing?

Please accept my apology for not providing page numbers but my ePub version of your book does not contain page numbering on my Kindle Fire. I now have a valid reason why I should not buy eBooks. From hereon, I'll stick to printed technical books: they have always served me well. Not to lay the blame at your feet, but I'm going to buy your print version. I'm working on a project and I don't need the distractions.

Note from the Author or Editor:
On page 17, the line should read:
bsObj.findAll("", {"class":"green"})

On page 18, the line:
bsObj.div.h1
Should be moved from its original position and placed under the description of "Tag objects" where it says "Retrieved in lists or individually by calling find and findAll on a BeautifulSoup object, or drilling down, as in:" What follows this sentence should be the example "bsObj.div.h1"

Anonymous  Jul 02, 2015  Jul 22, 2015
Printed, PDF, ePub
Page 16
3rd code example showing how to return both red and green span tags

.findAll("span", {"class": "green", "class": "red"})

an attempt to create a Python dict with repeated keys will preserve just the last one that is entered in the dict.

The correct would be:

.findAll("span", {"class": {"green", "red"}}

Note that we're passing now a collection (set) as the value for the "class" key on the attributes dict.

Note from the Author or Editor:
The line on page 16, in Chapter 2, should read:
.findAll("span", {"class":{"green", "red"}})

Anonymous  Jul 04, 2015  Jul 22, 2015
Printed
Page 16
line 11 from bottom, 2nd paragraph from bottom, 3rd sentence

"If it is false,"should be read as "If it is False,"

Note from the Author or Editor:
This will be fixed in upcoming prints and editions

Toshi Kurokawa  Dec 17, 2015  Oct 30, 2015
Printed
Page 16
last line of footnote

the section BeautifulSoup and regular expressions.

should be read as
the section "Regular Expressions and BeautifulSoup."

Toshi Kurokawa  Dec 17, 2015  Oct 30, 2015
PDF
Page 16
line 12 from bottom

“If recursion is set to True”should be read as “If recursive is set to True”

Note from the Author or Editor:
Fixed in upcoming prints

Toshi Kurokawa  Dec 29, 2015  Oct 30, 2015
PDF, ePub
Page 18
8th paragrah

The paragraph states:
"Retrieved in lists or individually by calling find and findAll on a BeautifulSoup object, or drilling down, as in:"
It ends with a colon, but it is followed by a new paragraph.
Suggestion:
It looks like the 4th paragraph (a line with only "bsObj.div.h1") should be moved there instead, and not simply removed, as suggested in the Note from the Author or Editor.

Anonymous  Jul 17, 2015  Jul 22, 2015
Printed
Page 20
last line of 3rd paragraph

The 'body' of "body tag" should be Bold font.

Toshi Kurokawa  Dec 17, 2015  Mar 18, 2016
PDF
Page 22
line 12 from bottom, in the tree

"- s<td> (2)"should be read as "- <td> (2)"

Toshi Kurokawa  Dec 17, 2015  Oct 30, 2015
PDF
Page 23
line 17 and 22 from top

The linear rule number 4 at line 17 says;
"4. Optionally, write the letter "d" at the end." which does not say blank at the end,
however,
the line 22 regEx says
aa*bbbbb(cc)*(d | ), where blank comes at the end.
this should be read as the following to be consistent with the rule.
aa*bbbbb(cc)*(d|).

Note from the Author or Editor:
Changed text to different, more useful, example

Toshi Kurokawa  Dec 17, 2015  Mar 18, 2016
PDF, ePub
Page 27
9th paragraph

The text reads:
"
from urllib.request
import urlopenfrom bs4
import BeautifulSoupimport re
"
It should be:
"

from urllib.request import url open
from bs4 import BeautifulSoup
import re
"

lbrancolini  Jul 17, 2015  Jul 22, 2015
PDF
Page 41-42
code snippet, followExternalOnly 3rd Printing

The code has serious bugs in handling internal Links. Here is a debugged code:

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import random

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bsObj, includeUrl):
internalLinks = []
#Finds all links that begin with a "/"
for link in bsObj.findAll("a",
href=re.compile("^(\/|.*(http:\/\/"+includeUrl+")).*")):
if link.attrs['href'] is not None and len(link.attrs['href']) != 0:
if link.attrs['href'] not in internalLinks:
internalLinks.append(link.attrs['href'])
return internalLinks

#Retrieves a list of all external links found on a page
def getExternalLinks(bsObj, url):
excludeUrl = getDomain(url)
externalLinks = []
#Finds all links that start with "http" or "www" that do
#not contain the current URL
for link in bsObj.findAll("a",
href=re.compile("^(http)((?!"+excludeUrl+").)*$")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in externalLinks:
externalLinks.append(link.attrs['href'])
return externalLinks

def getDomain(address):
return urlparse(address).netloc

def followExternalOnly(bsObj, url):
externalLinks = getExternalLinks(bsObj, url)
if len(externalLinks) == 0:
print("Only internal links here. Try again.")
internalLinks = getInternalLinks(bsObj, getDomain(url))
if len(internalLinks) == 0:
return
if len(internalLinks) == 1:
randInternalLink = internalLinks[0]
else:
randInternalLink = internalLinks[random.randint(0, len(internalLinks)-1)]
if randInternalLink[0:4] != 'http':
randInternalLink = 'http://'+getDomain(url)+randInternalLink
if randInternalLink == url and len(internalLinks) == 1:
return
bsObjnext = BeautifulSoup(urlopen(randInternalLink), "html.parser")
#Try again
followExternalOnly(bsObjnext, randInternalLink)
else:
randomExternal = externalLinks[random.randint(0, len(externalLinks)-1)]
try:
nextBsObj = BeautifulSoup(urlopen(randomExternal), "html.parser")
print(randomExternal)
#Next page!
followExternalOnly(nextBsObj, randomExternal)
except HTTPError:
#Try again
print("Encountered error at "+randomExternal+"! Trying again")
followExternalOnly(bsObj, url)

url = "http://oreilly.com"
bsObj = BeautifulSoup(urlopen(url), "html.parser")
#Recursively follow external links
followExternalOnly(bsObj, url)

Note from the Author or Editor:
This code has been updated on Github and will be fixed in upcoming prints and editions of the book

Toshi Kurokawa  Jan 03, 2016  Mar 18, 2016
PDF
Page 42
Inside the 'getRandomExternalLink' function.

Inside the getRandomExternalLink function in the if/else statement, the 'if' statement is set to return 'getNextExternalLink' if the length of externalLinks is equal to zero.

The 'getNextExternalLink' was never defined.

Note from the Author or Editor:
Updated code can be found in the github repository at: https://github.com/REMitchell/python-scraping/blob/master/chapter3/4-getExternalLinks.py

Anonymous  Sep 14, 2015  Oct 30, 2015
PDF
Page 42
Line 5 from top, the comment

#Finds all links that start with "http" or "www" that do
Should be read as
#Finds all links that start with "http" that do
To reflect the revised code line 8 from top

Note from the Author or Editor:
Changed the code to reflect this comment

Toshi Kurokawa  Jan 01, 2016  Mar 18, 2016
PDF
Page 42
the bottom example lines

Random external link is: http://igniteshow.com/
Random external link is: http://feeds.feedburner.com/oreilly/news
Random external link is: http://hire.jobvite.com/CompanyJobs/Careers.aspx?c=q319
Random external link is: http://makerfaire.com/
Should be read as
http://igniteshow.com/
http://feeds.feedburner.com/oreilly/news
http://hire.jobvite.com/CompanyJobs/Careers.aspx?c=q319
http://makerfaire.com/
Reflecting revised code print function, line 10 from the bottom of code snippet.

Note from the Author or Editor:
Updated code to reflect printout

Toshi Kurokawa  Jan 01, 2016  Mar 18, 2016
PDF
Page 45
The bottom schema

The directory structure is different from the shown as:
• scrapy.cfg
— wikiSpider
— __init.py__
— items.py

This should be the following:
—scrapy.cfg
— wikiSpider
 — __init.py__
 — items.p

Toshi Kurokawa  Dec 17, 2015  Mar 18, 2016
PDF
Page 46
1st sentence

The 1st sentence:
In order to create a crawler, we will add a new file to wikiSpider/wikiSpider/spiders/
articleSpider.py called items.py.

Should be read as:
In order to create a crawler, we will add a new file, articleSpider.py, to wikiSpider/wikiSpider/spiders/.

Toshi Kurokawa  Dec 17, 2015  Mar 18, 2016
PDF
Page 46
Bottom paragraph, 3rd line and 2nd line

The two words “WikiSpider”should be read as “wikiSpider”.

Toshi Kurokawa  Dec 17, 2015  Oct 30, 2015
PDF
Page 48
Side bar ‘Logging with Scrapy’Last sentence

The last sentence tells:
This will create a new logfile, if one does not exist, in your current directory and output all logs and print statements to it.

this should be read as

This will create a new logfile, if one does not exist, in your current directory and output all logs to it.

Toshi Kurokawa  Dec 17, 2015  Mar 18, 2016
PDF
Page 58
1st code after side-bar of Twitter Credential Permissions

from twitter import Twitter

shoulde be read as
from twitter import Twitter, OAuth

Toshi Kurokawa  Dec 18, 2015  Oct 30, 2015
PDF
Page 62
2nd paragraph, 1st sentence

Google’s Geocode API,
should be read as
Google’s Geocoding API

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 67
code snippe

insert
import json

this is missing in https://github.com/REMitchell/python-scraping/blob/master/chapter4/6-wikiHistories.py

Note from the Author or Editor:
The import statement has been added for future versions of the book

Toshi Kurokawa  Dec 29, 2015  Mar 18, 2016
Printed
Page 73
getAbsoluteURL()

#second elif:
url = source[4:]
url = "http://"+source

#should be:
url = "http://"+source[4:]

Lem Dulfo  Sep 13, 2015  Oct 30, 2015
PDF
Page 73
last for loop of code snippet

The last part of code snippet:
bsObj = BeautifulSoup(html)
downloadList = bsObj.findAll(src=True)
for download in downloadList:
fileUrl = getAbsoluteURL(baseUrl, download["src"])
if fileUrl is not None:
print(fileUrl)

urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

should be
for download in downloadList:
fileUrl = getAbsoluteURL(baseUrl, download["src"])
if fileUrl is not None:
print(fileUrl)
urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl,
downloadDirectory))

Note from the Author or Editor:
This was caused by an indentation error. It has been fixed in Github and will be fixed for future editions and prints of the book.

Toshi Kurokawa  Jan 05, 2016  Mar 18, 2016
Printed, PDF, ePub
Page 84
block of code

The line of code:
import re
is missing: a Regular Expression is used at the end of the getLinks function:
return bsObj.find("div",{"id":"bodyContent"}).findAll("a",
href=re.compile("^(/wiki/)((?!:).)*$"))

Anonymous  Jul 21, 2015  Oct 30, 2015
PDF
Page 88
import statements in code snippet

from urllib.request import urlopen
are appear twice – reduntdant.

Toshi Kurokawa  Jan 01, 2016  Mar 18, 2016
PDF
Page 94
last sentence before the section, Text

In this chapter, I’ll cover several commonly encountered types of files: text, PDFs, PNGs, and GIFs.

However the PNG and GIF are not covered. It should be read as:
In this chapter, I’ll cover several commonly encountered types of files: text, PDFs, and .docx.

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 98
4th paragraph

Whereas the European Computer Manufacturers Association’s website has this tag

However, it is now officially ECMA International, so it should be read as:
Whereas the ECMA International’s website has this tag

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 104
output of the <w:t>tag, last ouput example

This is a Word document, full of content that you want very much. Unfortunately,
it’s difficult to access because I’m putting it on my website as a . docx
file, rather than just publishing it as HTML

should be read as
This is a Word document, full of content that you want very much. Unfortunately, it’s difficult to access because I’m putting it on my website as a .
docx
file, rather than just publishing it as HTML

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
Printed
Page 113
8th paragraph, final code on page

In the Data Normalization section of chapter 7, there is a reference to recording the frequency of the 2-grams, then at the bottom of the page we are given a code snippet that introduces OrderedDict and uses the sorted function. In the sorted function the code contains ngrams.items() however the ngrams method returns a list and lists do not have an items() method. So the program generates an error.

In the next chapter, it looks like the code (at least on GitHub) has the ngrams function return a dictionary instead which allows the code in chapter 7 to work.

Note from the Author or Editor:
I mentioned the code that would accomplish this in passing, but did not actually include it. It will be included in future printings of the book, and in the ebook.

Micheal Beatty  Aug 16, 2015  Oct 30, 2015
PDF
Page 113
line 4 output

("['Software', 'Foundation']", 40), ("['Python', 'Software']", 38),....

should be read as
OrderedDict([("['Software', 'Foundation']", 40), ("['Python', 'Software']", 38),....

Note from the Author or Editor:
Updated to: "OrderedDict([('of the', 38), ('Software Foundation', 37), ..."

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 113
line 12, output of ngramas

The current output is inconsistent with the code snippet.
("['Software', 'Foundation']", 40), ("['Python', 'Software']", 38), ("['of', 'th
e']", 35), ("['Foundation', 'Retrieved']", 34), ("['of', 'Python']", 28), ("['in
', 'the']", 21), ("['van', 'Rossum']", 18)

First, as the value of ngrams is an OrderedDict.
Second, the getNgrams generate a string for 2gram instead of list of 2 strings.

The actual output looks like the following:
OrderedDict([('Software Foundation', 37), ('of the', 37), ('Python Software', 37), ('Foundation Retrieved', 32), ('of Python', 32), ('in the', 22), ('such as', 20), ('van Rossum', 19)...

Note from the Author or Editor:
Updated the output of the script to reflect the use of the OrderedDict

Toshi Kurokawa  Jan 02, 2016  Mar 18, 2016
PDF
Page 115
line 6 from the bottom

me data that contains four or more comma-seperated programming languages

Should be read as
me data that contains three or more comma-seperated programming languages

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 118
the last sentence

The last sentence refers:
guide to the language can be found on OpenRefine’s GitHub page

This pointer refers to https://github.com/sixohsix/twitter/tree/master, which is not the precise page for the OpenRefine guide documents.

This should be
https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 122
output bullets at the bottom

• The Constitution of the United States is the instrument containing this grant of
power to the several departments composing the government.
Should be read as
• The Constitution of the United States is the instrument containing this grant of
power to the several departments composing the Government.

The general government has seized upon none of the reserved rights of the states.
Should be read as
The General Government has seized upon none of the reserved rights of the States.

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 123
bulletted ouput at the top

The presses in the necessary employment of the government should never be used
to clear the guilty or to varnish crime.

Should be read as
The presses in the necessary employment of the Government should never be used
to “clear the guilty or to varnish crime.”

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 123
The 2nd sentence, reference of That can be my next tweet! app

The link embedded in PDF for "That can be my next tweet!" is a wrong one, that should be
http://yes.thatcan.be/my/next/tweet/

Note from the Author or Editor:
The page has changed since the book was written. Updated for future editions

Toshi Kurokawa  Jan 01, 2016  Mar 18, 2016
PDF
Page 139
line 10 from bottom, the 1st bullet

name is email_address)

should be read as
name is email_addr)

Toshi Kurokawa  Dec 18, 2015  Oct 30, 2015
PDF
Page 139
line 4-5 from bottom in the code snippet

The part of code snippet
r = requests.post("http://post.oreilly.com/client/o/oreilly/forms/
quicksignup.cgi", data=params)

causes EOL error because of string break. It should be like the following:
r = requests.post(
"http://post.oreilly.com/client/o/oreilly/forms/quicksignup.cgi",
data=params)

Note from the Author or Editor:
Because of the limitations of printing, there are many instances throughout the book where code needs to be cut off and continued on the next line. Please either correct these as you copy them from the book, or refer to the code repository on Github.
In this case, I will use the suggested version, because it corrects an issue with the syntax highlighting caused with this particular line break.

Toshi Kurokawa  Jan 06, 2016  Mar 18, 2016
Printed
Page 141
Code sample on bottom of page

The code says `name="image"`, but following page suggests (and code on actual site is) `name="uploadFile"`.

Ian Gow  Jan 02, 2016  Mar 18, 2016
PDF
Page 142
4th paragraph from the top

Once a site authenticates your login credentials a it stores in your browser a cookie,

Should be read as
Once a site authenticates your login credentials, it stores in your browser a cookie,

Note from the Author or Editor:
Changed to "Once a site authenticates your login credentials it stores them in your browser’s cookie"

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 149
footnote

http://blog.jquery.com/2014/01/13/the-stateof-jquery-2014/

should be read a
http://blog.jquery.com/2014/01/13/the-state-of-jquery-2014/

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 149
1st sentence of 2nd paragraph

If you find jQuery is found on a site, you must be careful when scraping it. jQuery is
Should be read as
If you find jQuery on a site, you must be careful when scraping it. jQuery is

Toshi Kurokawa  Dec 29, 2015  Mar 18, 2016
PDF
Page 154
code at the bottom and the line above

page has been fully loaded: from selenium import webdriver.

from selenium.webdriver.common.by import By

should be layouted as

page has been fully loaded:

from selenium import webdriver.
from selenium.webdriver.common.by import By

Toshi Kurokawa  Dec 29, 2015  Mar 18, 2016
PDF
Page 162
line 6 from the top

The link for installing Pillow
http://pillow.readthedocs.org/installation.html does not work, instead use
http://pillow.readthedocs.org/en/3.0.x/

Note from the Author or Editor:
The link has changed since publication, and is updated in future versions.

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 169
3rd Paragraph 1st sentence after :

Computer Automated Public Turing test to tell Computers and Humans Apart

should be read as
Completely Automated Public Turing test to tell Computers and Humans Apart

Toshi Kurokawa  Dec 18, 2015  Oct 30, 2015
ePub
Page 172
Figure 8.1

The diagram 8.1 about a Markov weather model has one incorrect percentage value and one incorrect arrow direction:

1. The value for Sunny being sunny the next day should be 70% rather than 20%.
2. The arrow for the 15% chance of Rainy being followed by Cloudy should be reversed so that this shows a 15% chance of Cloud being followed by Rain.

Note from the Author or Editor:
The description is correct. The corrected Markov diagram is: http://pythonscraping.com/img/markov_8.1.png

Dane Wright  Jul 17, 2015  Jul 22, 2015
PDF
Page 172
2nd paragraph from the bottom and the code snippet

The paragraph and the 1st code refer/define main as exist,in the https://github.com/REMitchell/tesseract-trainer/blob/master/trainer.py code referred at the preceding paragraph.

However, there is no main method in this code example, instead you use __init__. So, the main should be read as __init__.

Toshi Kurokawa  Dec 29, 2015  Mar 18, 2016
PDF
Page 186
line 3 from the bottom

Use a tool such as Chrome’s Network
inspector to

Should be read as
Use a tool such as Chrome’s Network
panel to

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 221
last paragraph in side column

In the second scenario, the load your Internet connection and home machine can

Should be read as
In the third scenario, the load your Internet connection and home machine can

Toshi Kurokawa  Dec 18, 2015  Mar 18, 2016
PDF
Page 230
1st line

DMCS Safe Harbor

should be read as
DMCA Safe Harbor

Note from the Author or Editor:
Fixed in upcoming prints

Toshi Kurokawa  Dec 28, 2015  Mar 18, 2016