Errata


Print Print Icon

Submit your own errata for this product.


The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.


Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question



Version Location Description Submitted By
Printed Page xix
3rd code snippet

print [v*10 for v in l1 if if v1>4]

should read

print [v*10 for v in l1 if if v>4]

Anonymous 
Printed Page xix
2nd to last paragraph

move -> movie

Anonymous 
Printed Page xviii
Under List Comprehensions

print [v*10 for v in l1 if v1 > 4]

should read:

print [v*10 for v in l1 if v > 4]

Anonymous 
Printed Page xvii
Third example under the List and dictionary constructors heading

The example list is given as:

string_list = ['a', 'b', 'c', 'd',]

The third example under the List and dictionary heading states the following:

string_list[2] # returns 'b'

Actually, since [2] is the offset, and list offsets start with [0],

string_list[2] should actually return 'c' not 'b' .

Anonymous 
Printed Page xx
line 13

v1 should be v

Anonymous 
Printed Page xvii
9th Paragraph

Under the Heading "Overview of the Chapters", sub-heading "Chapter 2..." The final clause of the paragraph contains the word "move" which should be "movie".

Anonymous 
Printed Page xviii
List comprehensions

[xviii] Python Tips: List comprehensions;
The following statement is NOT valid python code.

l1=[1,2,3,4,5,6,7,8,9]
print [v*10 for v in l1 if v1>4]

There is no variable called v1 in this scope. The code should probably look more like this:

l1=[1,2,3,4,5,6,7,8,9]
print [v*10 for v in l1 if v>4]

Notice that the if statement was changed.

Ryan 
Safari Books Online NA
Section 6.4.1

c1 = classifier.classifier()
c1.sampletrain()
c1.fprob( 'money', 'bad' ) = 0.5
c1.weightedprob( 'money', 'bad' ) = 0.5

This is the result if we take the examples from his sampletrain(). Yet he lists a hypothetical example where there is only 1 document in the 'bad' category and comes up with these answers:

weight = 1, ap = 0.5, total = 1, raw = 1
c1.weightedprob( 'money', 'bad' ) = 0.75

But his sampletrain() will yield:

weight = 1, ap = 0.5, total = 1, raw = 0.5
c1.weightedprob( 'money', 'bad' ) = 0.5

If this is true, then his recurring example of c1.weightedprob( 'money', 'bad' ) = 0.75 is very misleading.

Anonymous 
Safari Books Online NA
Section 6.6.1

I quote:

"""
* clf = Pr(feature | category) for this category
* freqsum = Sum of Pr(feature | category) for all the categories
* cprob = clf / (clf+nclf)
"""

Where is 'nclf' anywhere here? The code itself says:

cprob = clf / freqsum

I might be mistaken but I am beginning to think that this book has been poorly edited. Some of the code have very bad names for variables, which reduces its readability. The editor probably did not take the time to properly peruse and understand this book, let alone peer review it.

It appears to be a case of, "Oh, what is your book called? 'Programming Artificial Intelligence for the Web'? Oh, dear, I'm afraid that is not going to sell. How about 'Programming Collective Intelligence'? I say, that should sell it to the Twitterverse, chaps."

Anonymous 
Safari Books Online NA
Section 6.7.1

Persisting the classifier using SQLite is a good idea, but its implementation is terribly naive for the simple reason that primary keys for the tables are not specified. This leads to simply atrocious performance for anything but the most trivial of applications. Makes me think that the author wrote pretty much untested Python code for the book.

Anonymous 
Safari Books Online PCI_code.zip
addlinkref function

In the addlinkref function, the call to the separatewords function appears as separateWords rather than in all lowercase, as it defined within the text of the book. Mixing the two 'spellings' causes an error.

Marisano James 
Safari Books Online PCI_code.zip
chapter11, gridgame function, about halfway through

# Board wraps

Should read:

# Board limits

as listed in the code on p. 270 of the printed version of the book (1st ed.), and described in the first paragraph on p. 269.

Marisano James 
Printed Page 6
"National security" section (3rd from last paragraph)

"... and the analysis of this data requires ..."

should read,

"... and the analysis of these data requires ...",

as the word "data" is plural.

Marisano James 
Safari Books Online 8.1
Building a Sample Dataset

Text says:

"The following function generates 200 bottles of wine and
calculates their prices from the model."

However, the wineset1 code generates 300 bottles of wine.

Fix:

"The following function generates 300 bottles of wine and
calculates their prices from the model."

Anonymous 
Safari Books Online 8.1
Building a Sample Dataset

The wineprice function in text does not match online source.

In text:

# Past its peak, goes bad in 5 years
price=price*(5-(age-peak_age))

Online:

# Past its peak, goes bad in 10 years
price=price*(5-(age-peak_age)/2)

Anonymous 
Safari Books Online 8.1
Building a Sample Dataset

The noise calculation in the wineset1 function in the text does not match the online source.

In text:

"It then randomly adds or subtracts 20 percent..."

# Add some noise
price*=(random()*0.4+0.8)

Online:

# Add some noise
price*=(random()*0.2+0.9)

Anonymous 
Safari Books Online 8.3.3
Gaussian Function definition

The default value for sigma is printed as 10.0, but this should be 1.0 in order to match the graph and the printed results.

Wrong:

def gaussian(dist,sigma=10.0):
return math.e**(-dist**2/(2*sigma**2))

Right:

def gaussian(dist,sigma=1.0):
return math.e**(-dist**2/(2*sigma**2))

Anonymous 
Safari Books Online 8.4
Cross-Validation

Typo in text and difference between crossvalidate function source in text and online source.

1. Typo:

"Typically, the test set will be a small portion, perhaps 5 percent of [the] all the data..."

2. Crossvalidate source change:

In crossvalidate function source, note that default value of test parameter is 0.5 in the text, but is 0.1 in the online source.

Anonymous 
Safari Books Online 8.5
Heterogeneous Variables

Difference in bottle size between source code and text:

In text:

"Unlike the variables you've used so far, which were between 0 and 100, its range would be up to 1,500."

In source code in text, however, the bottle size ranges up to 3000.

(However, in the source code online, the maximum bottle size is 1500.)

Anonymous 
Safari Books Online 8.6
Optimizing the Scale

Two differences between text and online source:

1. In createcostfunction source in text, trails=10, whereas trails=20 in online source.

2. In weightdomain definition in text, 20 is the maximum weight:

"...let's restrict them to 20 for now."

whereas the maximum weight in online source is 10.

Anonymous 
Safari Books Online 8.6
Optimizing the Scale

The text refers to the geneticoptimize function, but this function is not defined in optimization.py (online).

"You can also try the slower but often more accurate geneticoptimize function and see if it returns similar results..."

However, there is a swarmoptimize function defined in optimization.py.

Anonymous 
Safari Books Online 8.7
Uneven Distributions

There is no createhiddendataset function but there is a wineset3 function:

"The createhiddendataset function creates a dataset that simulates these properties."

Should be:

"The wineset3 function creates a dataset that simulates these properties."

Anonymous 
Safari Books Online 8.7.2
Graphing the Probabilities

The input vector in the example below is wacky:

>>> numpredict.cumulativegraph(data, (1,1), 120)

To match the graph in Figure 8-10, try the following:

>>> numpredict.cumulativegraph(data,[99,20],120)

(Page 186 in printed version?)

Anonymous 
Safari Books Online 8.7.2
Graphing the Probabilities

The input vector and the high value are wacky:

>>> numpredict.probabilitygraph(data,(1,1),6)

To match Figure 8-11, try the following:

>>> numpredict.probabilitygraph(data,[99,20],120)

(Page 186?)

Anonymous 
Safari Books Online 8.9
When to Use k-Nearest Neighbors

Typo: s/observation/observations/

"It's also easy to interpret exactly what's happening because you know it's using the weighted value of other observation[s] to make its predictions."

Anonymous 
Printed Page 10
after 3rd and 4th paragraph

The examples read:

>> from math import sqrt
>> sqrt(pow(5-4,2)+pow(4-1,2))
3.1622776601683795

>> 1/(1+sqrt(pow(5-4,2)+pow(4-1,2)))
0.2402530733520421

but the values are wrong, they should be:

>> from math import sqrt
>> sqrt(pow(1-2,2)+pow(4.5-4.0,2))
1.1180339887498949

>> 1/(1+sqrt(pow(1-2,2)+pow(4.5-4.0,2)))
0.47213595499957939

paulo 
Printed Page 11
2nd code block

reload(recommendations)

is performed before the recommendations module has ever been loaded.

Anonymous 
Printed Page 11
last line of code example

The last line of the sim_distance function:
"return 1(1+*sqr*t(sum_of_squares))" should be
"return 1(1+sqrt(sum_of_squares))"

Anonymous 
Printed Page 11
1st example function

The 'confirmed' errata for this mistake was incorrect.

http://oreilly.com/catalog/9780596529321/errata/9780596529321.308 shows

return 1/(1+sum_of_squares)
should be
return 1/(1+*sqr*t(sum_of_squares))

when in fact

return 1/(1+sum_of_squares)
should be
return 1/(1+*sqrt(sum_of_squares))

And as a result of that change, the output of the function when run shown in the next code block on the page should have a result of

0.294298055085549
not
0.148148148148148

Anonymous 
Other Digital Version 11
refer to the online error report 'Changes made in the 3/08 printing'

In the error fix code on your page called:
Changes made in the 3/08 printing
at this link:
http://www.oreilly.com/catalog/9780596529321/errata/9780596529321.308

The error report reads like this:

{11} last line of code sample;
return 1/(1+sum_of_squares)
should be
return 1/(1+*sqr*t(sum_of_squares))

and the last line should be this:

return 1/(1+sqrt(sum_of_squares))

Anonymous 
Printed Page 11
Result of execution of code example (sim_dstance btw 'Lisa Rose' and 'Gene Seymour')

Book gives sim_distance between Lisa and Gene of 0.148148148148. My result was 0.29429805508554946; verified with calculator.

Justin Middleton 
Printed Page 11
last line of code AND python output

last line of code reads:

return 1/(1+*sqr*t(sum_of_squares))

Should read

return 1/(1+sqrt(sum_of_squares))

AND

Result is given below as 0.148148148148

This is equivalent to 1/(1+5.75)

5.75 is sum_of_squares NOT sqrt(sum_of_squares)

Result should be 0.294298055086

Ian Ford 
Printed Page 11
In the 4th paragraph from the bottom on page 11

It seems that the Euclidean distance-based similarity score between 'Lisa Rose' and 'Gene Seymour' should be 0.294298

Daqing Chen 
Printed Page 11
numerical result (0.148148...)

This is (supposedly) the result of using the function 'sim_distance' as it appears on this page, but it actually uses the statement

return 1/(1 + sum_of_squares),

rather than the (correct) statement

return 1/(1 + sqrt(sum_of_squares)).

Apparently, the error is the result of *using* the wrong version of the function, while the correct version is *printed* .

Yehiel Milman 
Printed Page 11
Bottom line of python code section describing the function sim_distance

The sim_distance function should return 1/(1+sqrt(sum_of_squares)) rather than 1/(1+sum_of_squares) when inverting the Euclidean distance. The formula for Euclidean distance includes a square root, but the square root is never taken of the sum of all the squares in this segment of code.

Thea 
Printed Page 11
Euclidian Distance Score code snippet

In the code snippet for sim_distance:

return 1/(1+sqrt(sum_of_squares))

Shouldn't this be

return 1/(1+sqrt(sum_of_squares/len(si)))

Also, in the example code zip on the companion website, this is

return 1/(1+sum_of_squares)

which should also be

return 1/(1+sqrt(sum_of_squares/len(si)))


I guess this error has no effect when getting recommendations for a single movie or critic; the resulting similarity values deviate, but keep the same order. However, this error has serious effect when ordering similarities over movies/critics. This is done in getting recommendations item based, on page 24.
Am I correct? (If I'm not, please excuse me for this sunday morning observation *yawn* )

Koen Mannaerts 
Printed Page 13
sim_pearson code fragment

Definition of sim_pearson is subject to integer math errors, at least in Python 2.5.1. The number of overlapping elements is saved as "n = len(is)". This is used later to determine the numerator of the formula.

It turns out this can allow the function to return values > 1, for this reason:

13 * 11 / 3 ==> 47

whereas:

13 * 11 / 3.0 ==> 47.666666666666664


Changing the initial line to be "n = float(len(si))" is enough to prevent this.


As someone in a separate errata noted, the function should also return 1.0 if the denominator is 0.

Anonymous 
PDF Page 13
in the sample code

# if they are no ratings in common, return 0
if n==0: return 0

must be:
# if they are no ratings in common, return 0
if n==0: return -1

because sim_pearson returns -1 if no match found, not 0 - zero means something about 50/50 matching, and this will give false results in all futher functions what use it, like topMatch e.g. at page [14]




Anonymous 
Printed Page 14
first paragraph of "Ranking the Critics"

"learning which movie critics have tastes simliar to mine"
^^^^^^^

Anonymous 
Printed Page 15
Table 2-2

FinalScore = Total/Sim.Sum

This will remove (or reduce) the weight of similarity.

Let's imagine this scenario, I'm X
A,B,C is very similar to X, let's say ABC has 0.9 similarity to X
D,E,F is quite different from X, let's say DEF has 0.1 similarity to X

A,B,C all rated Movie1 4.5
D,E,F all rated Movie2 4.6

So for Movie1 we get (4.5*0.9+4.5*0.9+4.5*0.9)/(0.9+0.9+0.9)=4.5
for Movie2, we get 4.6 which > Movie1

But I think this is a improper recommendation, as X should listen more to ABC, and ABC recommend Movie1.

I think it may be better to calculate FinalScore = Total/sqrt(Sim.Sum)
With this, we also can get consistent recommendation for the two examples in this book (Table 2-2 @page 15, and Table 2-3 page 24)

Thanks,

Fuchen Ying 
Printed Page 20

In the pydelicious.py module, an exception is raised when "feedparser" is imported.

Anonymous 
Printed Page 21
1st piece of code

I think the API might have changed and doesn't include the 'href' key anymore.

When I ran the sample code, it choked on this line in the first piece of code on the page:

for p2 in get_urlposts( p1[ 'href' ] )

The Python interpreter complained about the 'href' key:

>>> delusers = initializeUserDict( 'programming' )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "deliciousrec.py", line 8, in initializeUserDict
# find all users who posted this
KeyError: 'href'


Lina Faller 
Printed Page 31
2nd code sample

The code to split text into words uses this regular expression:

[^A-Z^a-z]

the above includes caret ('^') in the set with the alphabetic characters, so caret will not be used to split words. For example:

>>> re.compile("[^A-Z^a-z]").split("foo^bar")
['foo^bar']


The correct regex is [^A-Za-z]:

>>> re.compile("[^A-Za-z]").split("foo^bar")
['foo', 'bar']

Anonymous 
Printed Page 50
Very bottom of page

The variable "outersum" is defined but never used.

Anonymous 
Printed Page 62
2nd paragraph (suggested), just after def getentryid

The first edition of the book does not include the working definition for the addlinkref function. That is, as printed, the function only contains the pass statement, and unfortunately is never updated. This causes the code in the Inbound Links section to fail (particularly the PageRank algorithm). I suggest that the completed code be placed on p. 62, just after the getentryid definition. [The full function definition does appear in the PCI_code.zip file available via the Examples link, however.]

Marisano James 
Printed Page 68
1st code segment of the "Word Distance" section; 3rd from last line

The line

dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))])

contains an error that renders the code substantially less useful - often almost useless. Recall that Python indexing begins at zero. Thus for the line to work appropriately, it should read:

dist=sum([abs(row[i]-row[i-1]) for i in range(1,len(row))])

Note that the lower bound of the range has been decremented by one.

Summing across the word distances is correct, however (as explained in the text), unlike what was anonymously reported as an error on p. 69 below.

Marisano James 
Printed Page 69
1st code segment (4th line down)


dist = sum([abs(rows[i] - row[i-1) for row in range(2, len(ow))])

This makes no sense. Presumably sum should be min.

This is a lovely book, but I agree with an earlier submitter, very poorly proofread. This is especially the case as the code isn't commented. Even if the code were error free, it is quite terse and hard for a non Python expert to read. The errors really detract from readability.

Anonymous 
Printed Page 70
Figure 4-3

Page C has only three links to other pages. It should have four links to other pages.

Jarno Mielikainen 
Printed Page 70
Figure 4-3

The text states "C links to four other pages", but in the figure are only 3 additional arrows besides the one pointing to A. In the computation the total number of links on C is 5. Just a inconsistency.

Anonymous 
Printed Page 70
Last Paragraph

Looking at Figure 4-3, Page B and page C both have three links going to pages other than A. The text in the final paragraph states "B also has links to three other pages and C links to four other pages". Essentially, it looks like there should be an extra arrow coming out of C in the diagram to bring the total number of arrows to four.

Bryce Thomas 
Printed Page 71
Beginning of code section, calculatepagerank

To be most effective, a call to self.dbcommit() should be inserted after the self.con.execute('drop table if exists pagerank') call (i.e. just after the first command of the calculatepagerank function). This permits calculatepagerank to be called multiple times in a single python session without the database returning an error stating that it already contains a pagerank table, and subsequently remaining locked due to the error.

Marisano James 
PDF Page 81
last lines of last two paragraphs of the codes

The formulas to calculate output_deltas and hidden_deltas are wrong. According to the Delta Rule (http://en.wikipedia.org/wiki/Delta_rule), either these formulas should be error divided by dtanh, or the dtanh itself should be 1/(1-y*y).

Eric Wang 
Printed Page 88
Code definition for printschedule method

for d in range(0, len(r), 2):
should be
for d in range(0, len(people)):
Iterating over the schedule is incorrect, the list of people should be iterated over. This is why the out and ret variables are retrieved via r[2*d] and r[2*d+1], respectively. As is, this causes an index out of bounds exception since the length of the schedule is twice the length of the number of people.

Jakob Homan 
Printed Page 88,90
Throughout schedulecost and printschedule method definitions

All instances of determining the return flight (returnf and ret variables) should have their destination and origin assignments switched, in order to find a flight back from LGA. As printed, the code finds the same flight for both origin and return.

Jakob Homan 
Printed Page 88
bottom of page

The index for the line beginning with out= should be [int(r[2*d])] rather than [r[d]], and the index for the line beginning with ret= should be [int(r[2*d+1])] rather than [r[d+1]].

There is no need to change the range of the for loop in the printschedule function, however, or to reverse any of the destination and origin assignments, as was erroneously reported elsewhere in this errata list.

Anonymous 
Printed Page 89
code example output

there has been previous report on some other issues on the example, e.g airport name is nowhere in schedule.txt and not programmed, also, the discussion and code are a bit inconsistent. Now, when running example code, the output is very different from the printout in the book as well. In summary, the illustration here is very confusing.

Anonymous 
PDF Page 90
the last 3 and 4 lines

int(sol[d]) should be int(sol[2*d])
in(sol[d+1]) should be int(sol[2*d+1])

and therefore the calculate result at the next page should be 5469

Eric Wang 
Printed Page 90
3rd paragraph, first sentence

The text states that, "There are a huge number of possibilities for the getcost function defined here." There is no getcost function, however; "getcost" should be replaced with "schedulecost".

Marisano James 
Printed Page 91
3rd paragraph

In theory, you could try every possible combination, but in this example there are 10 flights, all with 6 possibilities, giving a total of 6^10 combinations.

instead of

In theory, you could try every possible combination, but in this example there are 16 flights, all with 9 possibilities, giving a total of 9^16 combinations.

b'cause 6 family members having 10 flights to LGA

Anonymous 
Printed Page 91
3rd paragraph

I believe that calculation should be:

(10 flights to LGA * 10 flights from LGA) ^ (# of family members) =
(10^2)^6 = 10^12

or in Python,
(10**2)**6 = 10**12

Anonymous 
Printed Page 91
3rd paragraph

Oh, and 9**16 is much closer to 300 trillion than it is to 300 billion

Anonymous 
Printed Page 92
1st code snippet

> return r
should be
> return bestr

Anonymous 
Printed Page 92
first code snippet

The randomoptimize function should return bestr, not r.

Andy Young 
Printed Page 92
2nd line of Python session section (toward top of page)

The domain goes from 0-10, i.e. 0..9 rather than 0..7. (See schedule.txt for confirmation; also on p. 97, 8 could not be included as part of the solution if the highest available value were 7.) This means the line specifying the domain should read:

domain=[(0,10)]*(len(optimization.people)*2)

Marisano James 
Printed Page 93
hillclimb function code sample

The spelling of neighbors is inconsistent in the code sample. All instances of the word neighbors/neighbours in the code sample should be changed either to "neighbors" or "neighbours", but not a mix of both.

Bryce Thomas 
Printed Page 93
bottom third of page, about halfway through the code section

The code as printed in the first edition of the book, allows for negative index references (in Python these will not cause index out of range errors, but will instead read from the opposite end of the list) and does not explore the full domain. One way to address these shortcomings is to change the domain to take on actual values from 0 up to 9 inclusive, and to make the following changes:

if sol[j]>domain[j][0]:

should be changed to read,

if sol[j] > domain[j][0] and sol[j] < domain[j][1] - 1:


Likewise,

if sol[j]<domain[j][1]:

should read,

if sol[j] < domain[j][1] and sol[j] > domain[j][0]:


Another solution is to adjust the range as stated, but then to enforce a wrap-around at the ends of the range (this approach allows the algorithm to test all of the neighboring schedules that comprise the local minimum),

for j in range(len(domain)):
# One way in each direction.
#
# Test modified index again the upper bound.
if sol[j] > domain[j][0]:
if sol[j]+1 < domain[j][1] - 1: value = sol[j]+1
else: value = domain[j][0]
neighbors.append(sol[0:j] + [value] + sol[j+1:])

# The lower upper bound.
if sol[j] < domain[j][1]:
# Use the mod function to wrap around.
neighbors.append(sol[0:j] + [(sol[j]-1)%domain[j][1]] + sol[j+1:])

Marisano James 
Printed Page 98
mutate function definition

The mutate function is missing an else clause (or some other mechanism for returning a default value). A possible solution would be to remove the elif conditional and guarantee that the function always returns a value:

# Mutate operation
def mutate(vec):
i=random.randint(0,len(domain)-1)
if random.random()<0.5 and vec[i]>domain[i][0]:
return vec[0:i]+[vec[i]-step]+vec[i+1:]
else:
return vec[0:i]+[vec[i]+step]+vec[i+1:]

Jeremy Mason 
Printed Page 98
code example in function def mutate

the mutate function returns None if neither the if or elif are true. This results in None being appended to the pop list which causes badness in the cost function.

suggest adding the following to the end of the mutate function

else:
return vec

Matt Mercer 
PDF Page 98
the mutate function of code example

the mutate function should has a else clause which return vec. Otherwise None will be added into pop occasionally

Eric Wang 
Printed Page 108
second code example

for i in range(len(dorms): slots += [i,i]

should be:

for i in range(len(dorms)): slots += [i,i]

Anonymous 
Printed Page 109
bottom

The output of dorm.printsolution(s) uses the value of s as generated by s=optimize.randomoptimize(), for a solution cost of 18 (the original generated solution.)

optimize.geneticoptimize() is then called, but the results are not reassigned to s, and the output is confusing.

It might be more clear if this session were changed to:

reload(dorm)
s=optimize.randomoptimize(dorm.domain,dorm.dormcost)
dorm.printsolution(s)
dorm.dormcost(s)
r=optimize.geneticoptimize(dorm.domain,dorm.dormcost)
dorm.printsolution(r)


so that the before and after can be seen.

Anonymous 
Printed Page 121
2nd sample code block

After importing docclass, the db should be initialized.

cl.setdb('test.db')

Or this line can be added to docclass.py

Anonymous 
Printed Page 123
1st paragraph

The function for the calculation of the weighted probabilities (weightedprob) is wrong. In more details, the variable "total" is calculated in a wrong way. "total" is a weighting factor for "basicprob" (probability to find a document with a given feature in a given category). In the book "total" is calculated as "the number of times this feature has appeared in all categories". However, the weighting factor "total" should be equal to the number of items in a considered category.

Anonymous 
Printed Page 129
very last line of code

I believe the last argument to the invchi2 function should *not* be multiplied by 2. That is, it should be:

return self.invchi2(fscore, len(features))

rather than:

return self.invchi2(fscore, len(features)*2)

Roy Pardee 
Printed Page 150
buildtree function

The recursive calls at the end of the buildtree function should propagate the scoref parameter. Otherwise if you use a scoref function besides the default "entropy" function it will only be used on the first call.

So instead of

trueBranch=buildtree(best_sets[0])

it should be

trueBranch=buildtree(best_sets[0], scoref)

Stan Dyck 
Printed Page 157
bottom of page, second last line of code

mdclassify function is called qualified by treepredict2, but treepredict2 is neither imported (doesn't exist anyway) or defined anywhere. Changing

treepredict2.mdclassify(['google', 'France',None,None],tree)

to simply

treepredict.mdclassify(['google', 'France',None,None],tree)

appears to generate the intended results.

Bryce Thomas 
Printed Page 160
getaddressdata function code sample top half of page

It appears as though the Zillow API does not return "totalRooms" anymore (assuming it once did). Furthermore, some of the houses that it searches for appear to return no actual values from Zillow, or still cause the exception block to be hit. Later on, when the code asks for len(row) in the variance method of treepredict.py, this will cause "TypeError: object of type 'NoneType' has no len()".

As a work around:

rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data

should be commented out. Also, in getpricelist() function, change

l1.append(data)

to

if data != None
l1.append(data)


this should prevent null rows getting added and therefore prevent the TypeError exception described above.

Bryce Thomas 
Printed Page 161
Modelling "Hotness"

Hot or Not API is no longer available, so Hot or Not stuff will not work.

Bryce Thomas 
Printed Page 175
top half of page, gaussian function and interactive interpreter sample

The gaussian function shown at the top of the page does not produce the results shown in the interactive interpreter sample code.

E.g. numpredict.gaussian(1.0) produces 0.99501247... not 0.606530659... as shown.

To get results which match that of the sample interactive interpreter, change the gaussian function to:

def gaussian(dist,sigma=10.0):
return math.e**(-(dist*10)**2/(2*sigma**2))

This way, results match what's shown on the page and are then also consistent with what's illustrated in figure 8-5.

Bryce Thomas 
Printed Page 186
code example

numpredict.cumulativegraph(data, (1,1), 6)

the "high" price 6 doesn't make sense.
according to the figure 8-10, the high should be at least 120.

Anonymous 
Printed Page 186
interactive interpreter code sample

the cumulativegraph function is being passed the vector (1,1), which means it will draw the cumulative probability for the price of bottles of wine which have a rating of 1 and are 1 year old. These wine bottles would be so terrible that the cumulative probability would reach 1 at a price of 0, meaning there's nothing to see on the graph generated.

A better example would be to use say a wine bottle rating of 99 that's 10 years old, with a call like:

numpredict.cumulativegraph(data,(99,10),120)

This way, they'll actually be something useful displayed on the graph when its printed and it should look more like Figure 8-10 does.

Bryce Thomas 
Printed Page 188
interactive interpreter code sample

Like the error on page 186, using the vector (1,1) - a wine bottle with a rating of 1 and 1 year old, means that the wine would be so bad that it would all cost essentially 0. Instead, use a vector like say (99,10).

Also, the value of 6 used for high is quite small. To get something that looks similar to figure 8-11, the value 120 should be used instead. So, the line which reads:

numpredict.probabilitygraph(data,(1,1),6)

should become something like

numpredict.probabilitygraph(data,(99,10),120)

Bryce Thomas 
Printed Page 198
First paragraph, 2nd to last sentence

"This data is used ...", should probably read, "These data are used ..." [Data is the plural form of datum.]

Anonymous 
Printed Page 199
code sample

The paragraph after the code sample states "The points will be O if the people are a match and X if they are not." The code sample however draws a scatterplot where people that are a match are represented with a green O and people that are not a match are represented with a red O (not an X). To get a scatterplot that uses red X's instead of red O's, change the line:

plot(xdn,ydn,'ro')

to

plot(xdn,ydn,'rx')

Alternatively, to get a scatterplot that looks like that in figure 9-1 (which uses neither O's nor X's), change the same line:

plot(xdn,ydn,'ro')

to

plot(xdn,ydn,'r+')

Bryce Thomas 
Printed Page 199
middle of page (code section)

If the plotagematches function is really to be called from one's Python session (rather than being added to the advancedclassify.py file) then one does not need to reload advancedclassify, and should only type "plotagematches(agesonly)" instead of "advancedclassify.plotagematches(agesonly)" into the Python session.

Marisano James 
Printed Page 204
First Paragraph

There is a sentence which states "There are two other points, X0 and X1, which are examples that have to be classified". The diagram on the other hand uses points X1 and X2, but no X0. I believe that the sentence should be changed to "There are two other points, X1 and X2, which are examples that have to be classified."

Bryce Thomas 
Printed Page 208
top of page, 2nd line of getlocation function body

The URL for Yahoo! Maps Web Services has changed. The line reading,

'http:\\api.local.yahoo.com/MapServices/V1/'+\

should now read,

'http://local.yahooapis.com/MapsService/V1/'+\

Marisano James 
Printed Page 209
halfway down the page

Apparently there are two 824 3rd Avenues in New York city: one that's approximately 0.9 miles from 220 W 42nd St, and another that's about 6.6 miles away. It might be good to include zip codes with some, or all, of the addresses from the matchmaker.csv file to avoid confusion. [The new Yahoo! Maps Services returns the latitude and longitude for the second address - the one that's circa 6.6 miles away - as the default rather than those of the address that's approx. 0.9 miles away, as listed in the book.]

Marisano James 
Printed Page 210
In the scaledata function

In the book, scaleinput should be defined as:

# Create a function that scales data
def scaleinput(d):
return [(d[i] - low[i]) / (high[i] - low[i]) for i in range(len(low))]

In the printed copy, the return statement tries to access d.data[i], even though the data member was passed in to scaleinput, in both scaledata and in the interpreter example below the code.

Anonymous 
Printed Page 213
top

The printed first edition does not include the source code for the veclength function,

def veclength(v):
return sum([p**2 for p in v])

Perhaps it should be placed toward the top of page 213. [The function does appear in the file PCI_code.zip, available via the Examples link <http://examples.oreilly.com/9780596529321/>.]

Marisano James 
Printed Page 213
top

Just above the definition of the rbf function there should be an

import math

statement to ensure that the constant math.e is defined.

Marisano James 
Printed Page 213
top of page, in rbf definition

To be in accordance with the other gammas, and with the code listed in PCI_code.zip, the first line of the rbf function definition should read,

def rbf(v1, v2, gamma=10):

For the purposes of the book, I believe that the gamma parameter is always explicitly passed, however.

Marisano James 
Printed Page 213
bottom third of page (2nd from last line of the nlclassify function)

The line,

if y<0: return 0

should read,

if y>0: return 0

This is in accordance with the code included in PCI_code.zip and also allows the offset used on page 214 to be meaningfully calculated. (See below.) Without this change, each result returned by nlclassify will be the opposite of its intended value.

Marisano James 
Printed Page 214
Interactive interpreter sample at top of page

Before you can execute any of the statements, you first need to reload advancedclassify using:

reload(advancedclassify)

The book also never defines the variable "offset" which is being parsed into the nlclassify function. I'm not sure what a reasonable value for this would be.

Bryce Thomas 
Printed Page 214
top of page, just before the first call to nlclassify

Just after the missing reload(advancedclassify) statement (See Bryce's comment for more), one should execute

offset=advancedclassify.getoffset(agesonly)

To properly define the otherwise undefined offset. This should produce a value of some -0.0076450020098023288.

Marisano James 
Printed Page 218
top

The lines reading,

m.save(test.model)
m=svm_model(test.model)

should read,

m.save('test.model')
m=svm_model('test.model')

Also, in order for the svmc.pyd file to work with your Python environment, it must be of the correct version. The svmc.pyd file included in PCI_code.zip is for Python 2.4, the one available from the LIBSVM website (as of November 3, 2009) is for Python 2.6. I used this website to get a version that works with Python 2.5: <http://www.cs.sunysb.edu/~algorith/implement/libsvm/distrib/>. [It's also possible to build the svmc.pyd with C.]

Marisano James 
Printed Page 219
Facebook section

The code for facebook has change dramatically since the publishing and no longer functions. Use of the book code returns a "This API version is deprecated"

<?xml version="1.0" encoding="UTF-8"?>
<error_response xmlns="http://api.facebook.com/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://api.facebook.com/1.0/ http://api.facebook.com/1.0/facebook.xsd">
<error_code>12</error_code>
<error_msg>This API version is deprecated</error_msg>
<request_args list="true">
<arg>

Jeffery Shipman 
Printed Page 230
makematrix() function

Non-Negative Matrix Factorization is well defined only when there all no all-zero rows and columns. The makeMatrix() function, by excluding "words that are common but not too common," can eliminate all the word in some documents yielding all-zero rows. The evidence that this has occurred is error messages and NaNs in the output. It can be fixed two ways. A quick-and-dirty fix is to eliminate fewer words in makematrix(); this can be accomplished with simple changes to the if statement. A more robust fix is removing the resulting all-zero rows.

Scott Ainsworth 
Printed Page 261
3rd line from bottom of page

isinstance(t, node):

should read

hasattr(t, 'children'):

The former does not function correctly (under Python 2.5.2 at any rate).

Marisano James 
Printed Page 263
4th line from the bottom of the page

The line

if isinstance(t1, node) and isinstance(t2, node):

should read

if hasattr(t1, 'children') and hasattr(t2, 'children'):

The former does not allow the program to converge to the correct solution. [The latter code snippet also appears in the book's source code as distributed in PCI_Code.zip.]

Marisano James 
Printed Page 271
toward the bottom of the tournament function

As currently presented, the when there is a tie between players i and j it is as if player i lost. Presently the code reads,

elif winner==-1:
losses[i]+=1
losses[i]+=1
pass

However, it should read,

elif winner==-1:
losses[i]+=1
losses[j]+=1
pass

Applying this change (replacing the second losses[i] with losses[j]) will assist in evolving improved game-playing AI. [The above error appears in the online code, PCI_code.zip, as well.]

Marisano James 
Printed Page 313
9th line

The command should be
$ tar xvf numpy-1.0.2.tar
instead of
$ tar xvf numpy-1.0.2.tar.gz

Anonymous 


"If I had this book two years ago, it would have saved precious time going down some fruitless paths."
--Tim Wolters, CTO, Collective Intellect