Errata

Errata for Programming Collective Intelligence

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released.

The following errata were submitted by our customers and have not yet been approved or disproved by the author or editor. They solely represent the opinion of the customer.

Color Key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted by	Date submitted
Printed	Page xix 3rd code snippet	print [v10 for v in l1 if if v1>4] should read print [v10 for v in l1 if if v>4]	Anonymous	May 29, 2008
Printed	Page xviii Under List Comprehensions	print [v10 for v in l1 if v1 > 4] should read: print [v10 for v in l1 if v > 4]	Anonymous	Jul 19, 2008
Printed	Page xvii Third example under the List and dictionary constructors heading	The example list is given as: string_list = ['a', 'b', 'c', 'd',] The third example under the List and dictionary heading states the following: string_list[2] # returns 'b' Actually, since [2] is the offset, and list offsets start with [0], string_list[2] should actually return 'c' not 'b' .	Anonymous	Aug 06, 2008
Printed	Page xx line 13	v1 should be v	Anonymous	Nov 26, 2008
Printed	Page xvii 9th Paragraph	Under the Heading "Overview of the Chapters", sub-heading "Chapter 2..." The final clause of the paragraph contains the word "move" which should be "movie".	Anonymous	Dec 01, 2008
Printed	Page xviii List comprehensions	[xviii] Python Tips: List comprehensions; The following statement is NOT valid python code. l1=[1,2,3,4,5,6,7,8,9] print [v10 for v in l1 if v1>4] There is no variable called v1 in this scope. The code should probably look more like this: l1=[1,2,3,4,5,6,7,8,9] print [v10 for v in l1 if v>4] Notice that the if statement was changed.	Ryan	Jan 01, 2009
	NA Section 6.4.1	c1 = classifier.classifier() c1.sampletrain() c1.fprob( 'money', 'bad' ) = 0.5 c1.weightedprob( 'money', 'bad' ) = 0.5 This is the result if we take the examples from his sampletrain(). Yet he lists a hypothetical example where there is only 1 document in the 'bad' category and comes up with these answers: weight = 1, ap = 0.5, total = 1, raw = 1 c1.weightedprob( 'money', 'bad' ) = 0.75 But his sampletrain() will yield: weight = 1, ap = 0.5, total = 1, raw = 0.5 c1.weightedprob( 'money', 'bad' ) = 0.5 If this is true, then his recurring example of c1.weightedprob( 'money', 'bad' ) = 0.75 is very misleading.	Anonymous	Jun 19, 2009
	NA Section 6.6.1	I quote: """ * clf = Pr(feature \| category) for this category * freqsum = Sum of Pr(feature \| category) for all the categories * cprob = clf / (clf+nclf) """ Where is 'nclf' anywhere here? The code itself says: cprob = clf / freqsum I might be mistaken but I am beginning to think that this book has been poorly edited. Some of the code have very bad names for variables, which reduces its readability. The editor probably did not take the time to properly peruse and understand this book, let alone peer review it. It appears to be a case of, "Oh, what is your book called? 'Programming Artificial Intelligence for the Web'? Oh, dear, I'm afraid that is not going to sell. How about 'Programming Collective Intelligence'? I say, that should sell it to the Twitterverse, chaps."	Anonymous	Jun 19, 2009
	NA Section 6.7.1	Persisting the classifier using SQLite is a good idea, but its implementation is terribly naive for the simple reason that primary keys for the tables are not specified. This leads to simply atrocious performance for anything but the most trivial of applications. Makes me think that the author wrote pretty much untested Python code for the book.	Anonymous	Jul 15, 2009
	PCI_code.zip addlinkref function	In the addlinkref function, the call to the separatewords function appears as separateWords rather than in all lowercase, as it defined within the text of the book. Mixing the two 'spellings' causes an error.	Marisano James	Aug 05, 2009
	PCI_code.zip chapter11, gridgame function, about halfway through	# Board wraps Should read: # Board limits as listed in the code on p. 270 of the printed version of the book (1st ed.), and described in the first paragraph on p. 269.	Marisano James	Oct 30, 2009
Printed	Page freemonj@verizon.net freemonj@verizon.net	I wanted to reaffirm the following post: Printed Page 21 1st piece of code I think the API might have changed and doesn't include the 'href' key anymore. When I ran the sample code, it choked on this line in the first piece of code on the page: for p2 in get_urlposts( p1[ 'href' ] ) The Python interpreter complained about the 'href' key: >>> delusers = initializeUserDict( 'programming' ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "deliciousrec.py", line 8, in initializeUserDict # find all users who posted this KeyError: 'href' Lina Faller Jun 16, 2009	Freemon Johnson	Mar 25, 2011
Printed	Page xvii 5th paragraph	string_list[2] # returns 'b' should be string_list[2] # returns 'c'	Anonymous	May 26, 2011
Printed	Page xviii 3rd paragraph	example code: print [v10 for v in l1 if v1 > 4] should be print [v10 for v in l1 if v > 4]	Anonymous	May 26, 2011
Other Digital Version	chapter6/feedfilter.py Example Code - chapter6/feedfilter.py - Lines 22 and 26	In the example code, there's a bug in 'chapter6/feedfilter.py'. I think you should change the code like this. Line 22: -- print 'Guess: '+str(classifier.classify(entry)) ++ print 'Guess: '+str(classifier.classify(fulltext)) Line 26: -- classifier.train(entry,cl) ++ classifier.train(fulltext,cl) (submitted by a customer via email)	Anonymous	Aug 01, 2011
PDF	Page xvi 2	print [v10 for v in l1 if v1>4] should be print [v10 for v in l1 if v>4]	Kenny Darrell	Oct 09, 2011
Printed, PDF, ePub, Mobi, , Other Digital Version	Page xv Python Tips	I the Programming Collective Intelligence Book, in the Preface in Python Tips section - there is a mistake: ---------------- Python has a good set of primitive types and two that are used heavily throughout this book are list and dictionary. A list is an ordered list of any type of value, and it is constructed with square brackets: number_list=[1,2,3,4] string_list=['a', 'b', 'c', 'd'] mixed_list=['a', 3, 'c', 8] A dictionary is an unordered set of key/value pairs, similar to a hash map in other languages. It is constructed with curly braces: ages={'John':24,'Sarah':28,'Mike':31} The elements of lists and dictionaries can be accessed using square brackets after the list name: string_list[2] # returns 'b' ages['Sarah'] # returns 28 ---------------- In reality it will return c , as in python an array starts from 0. Have a nice day. ? Best regards, PhD Student Pavel Kucherbaev University of Trento, Italy	Adam Zaremba	Aug 01, 2013
Other Digital Version	Fig 2-3 Dupree data point	Error was found in Kindle edition, copyright 2008 The Dupree data point in Figure 2-3 is incorrect. It is shown at [1,2] in the publication; however, according to the values in the "critics" dictionary (both in the book and online), it should be displayed at [3.5, 2.5].	Bill Lampeter	Jan 12, 2015
Other Digital Version	all all references for kiwitobes links	All links pointed to kiwitobes site returning 404 error.!	Tiago Heineck	Oct 05, 2017
	8.1 Building a Sample Dataset	Text says: "The following function generates 200 bottles of wine and calculates their prices from the model." However, the wineset1 code generates 300 bottles of wine. Fix: "The following function generates 300 bottles of wine and calculates their prices from the model."	Anonymous	May 04, 2009
	8.1 Building a Sample Dataset	The wineprice function in text does not match online source. In text: # Past its peak, goes bad in 5 years price=price(5-(age-peak_age)) Online: # Past its peak, goes bad in 10 years price=price(5-(age-peak_age)/2)	Anonymous	May 04, 2009
	8.1 Building a Sample Dataset	The noise calculation in the wineset1 function in the text does not match the online source. In text: "It then randomly adds or subtracts 20 percent..." # Add some noise price=(random()0.4+0.8) Online: # Add some noise price=(random()0.2+0.9)	Anonymous	May 04, 2009
	8.3.3 Gaussian Function definition	The default value for sigma is printed as 10.0, but this should be 1.0 in order to match the graph and the printed results. Wrong: def gaussian(dist,sigma=10.0): return math.e(-dist2/(2sigma2)) Right: def gaussian(dist,sigma=1.0): return math.e(-dist2/(2sigma**2))	Anonymous	May 04, 2009
	8.4 Cross-Validation	Typo in text and difference between crossvalidate function source in text and online source. 1. Typo: "Typically, the test set will be a small portion, perhaps 5 percent of [the] all the data..." 2. Crossvalidate source change: In crossvalidate function source, note that default value of test parameter is 0.5 in the text, but is 0.1 in the online source.	Anonymous	May 06, 2009
	8.5 Heterogeneous Variables	Difference in bottle size between source code and text: In text: "Unlike the variables you've used so far, which were between 0 and 100, its range would be up to 1,500." In source code in text, however, the bottle size ranges up to 3000. (However, in the source code online, the maximum bottle size is 1500.)	Anonymous	May 06, 2009
	8.6 Optimizing the Scale	Two differences between text and online source: 1. In createcostfunction source in text, trails=10, whereas trails=20 in online source. 2. In weightdomain definition in text, 20 is the maximum weight: "...let's restrict them to 20 for now." whereas the maximum weight in online source is 10.	Anonymous	May 06, 2009
	8.6 Optimizing the Scale	The text refers to the geneticoptimize function, but this function is not defined in optimization.py (online). "You can also try the slower but often more accurate geneticoptimize function and see if it returns similar results..." However, there is a swarmoptimize function defined in optimization.py.	Anonymous	May 06, 2009
	8.7 Uneven Distributions	There is no createhiddendataset function but there is a wineset3 function: "The createhiddendataset function creates a dataset that simulates these properties." Should be: "The wineset3 function creates a dataset that simulates these properties."	Anonymous	May 06, 2009
	8.7.2 Graphing the Probabilities	The input vector in the example below is wacky: >>> numpredict.cumulativegraph(data, (1,1), 120) To match the graph in Figure 8-10, try the following: >>> numpredict.cumulativegraph(data,[99,20],120) (Page 186 in printed version?)	Anonymous	May 06, 2009
	8.7.2 Graphing the Probabilities	The input vector and the high value are wacky: >>> numpredict.probabilitygraph(data,(1,1),6) To match Figure 8-11, try the following: >>> numpredict.probabilitygraph(data,[99,20],120) (Page 186?)	Anonymous	May 06, 2009
	8.9 When to Use k-Nearest Neighbors	Typo: s/observation/observations/ "It's also easy to interpret exactly what's happening because you know it's using the weighted value of other observation[s] to make its predictions."	Anonymous	May 07, 2009
Printed	Page 10 after 3rd and 4th paragraph	The examples read: >> from math import sqrt >> sqrt(pow(5-4,2)+pow(4-1,2)) 3.1622776601683795 >> 1/(1+sqrt(pow(5-4,2)+pow(4-1,2))) 0.2402530733520421 but the values are wrong, they should be: >> from math import sqrt >> sqrt(pow(1-2,2)+pow(4.5-4.0,2)) 1.1180339887498949 >> 1/(1+sqrt(pow(1-2,2)+pow(4.5-4.0,2))) 0.47213595499957939	paulo	Aug 26, 2009
Printed	Page 10 code just below 3rd paragraph	Pythagoras' theorem: length^2 = (y2-y1)^2 + (x2-x1)^2 length = sqrt((y2-y1)^2 + (x2-x1)^2) If each x,y pair coordinate is (value of Dupree, value of Snakes) Toby's x,y pair coordinate is (1,4.5) LaSalle's x,y pair coordinate is (2,4) The distance between Toby and LaSalle is: >>> sqrt(pow(4.5-4,2)+pow(1-2,2)) 1.1180339887498949 >>> You can just visually inspect the chart and see the distance between Toby and LaSalle is not even close to 3.16227....	Edward Welling	Mar 15, 2010
PDF	Page 10 Between 3rd and 4th paragraph (code example)	The author is taking the hypotenuse between two (a,b) coordinates. The formula written out is: sqrt(pow(5-4,2)+pow(4-1,2)) Toby's coordinates are (4,5, 1) ~ His score of films "Snakes on a Plane" and "Me and Dupree". Lasalle's corresponding coordinates are (4, 2). Additionally, the difference in each axis should be expressed as an absolute value so: sqrt(pow(fabs(4.5-4),2)+pow(fabs(1-2),2))	Daniel Sikar	Apr 13, 2010
PDF	Page 10 Between 3rd and 4th paragraph (code example)	In addition to my previous submission, the fabs function does not need to be used since the difference is being squared, so my suggested edit: sqrt(pow(4.5-4,2)+pow(1-2,2))	Daniel Sikar	Apr 13, 2010
Printed	Page 11 2nd code block	reload(recommendations) is performed before the recommendations module has ever been loaded.	Anonymous	Jun 20, 2008
Printed	Page 11 last line of code example	The last line of the sim_distance function: "return 1(1+sqrt(sum_of_squares))" should be "return 1(1+sqrt(sum_of_squares))"	Anonymous	Jun 21, 2008
Printed	Page 11 1st example function	The 'confirmed' errata for this mistake was incorrect. http://oreilly.com/catalog/9780596529321/errata/9780596529321.308 shows return 1/(1+sum_of_squares) should be return 1/(1+sqrt(sum_of_squares)) when in fact return 1/(1+sum_of_squares) should be return 1/(1+*sqrt(sum_of_squares)) And as a result of that change, the output of the function when run shown in the next code block on the page should have a result of 0.294298055085549 not 0.148148148148148	Anonymous	Sep 26, 2008
Other Digital Version	11 refer to the online error report 'Changes made in the 3/08 printing'	In the error fix code on your page called: Changes made in the 3/08 printing at this link: http://www.oreilly.com/catalog/9780596529321/errata/9780596529321.308 The error report reads like this: {11} last line of code sample; return 1/(1+sum_of_squares) should be return 1/(1+sqrt(sum_of_squares)) and the last line should be this: return 1/(1+sqrt(sum_of_squares))	Anonymous	Nov 28, 2008
Printed	Page 11 Result of execution of code example (sim_dstance btw 'Lisa Rose' and 'Gene Seymour')	Book gives sim_distance between Lisa and Gene of 0.148148148148. My result was 0.29429805508554946; verified with calculator.	Justin Middleton	Apr 05, 2009
Printed	Page 11 last line of code AND python output	last line of code reads: return 1/(1+sqrt(sum_of_squares)) Should read return 1/(1+sqrt(sum_of_squares)) AND Result is given below as 0.148148148148 This is equivalent to 1/(1+5.75) 5.75 is sum_of_squares NOT sqrt(sum_of_squares) Result should be 0.294298055086	Ian Ford	May 06, 2009
Printed	Page 11 In the 4th paragraph from the bottom on page 11	It seems that the Euclidean distance-based similarity score between 'Lisa Rose' and 'Gene Seymour' should be 0.294298	Daqing Chen	Jul 18, 2009
Printed	Page 11 numerical result (0.148148...)	This is (supposedly) the result of using the function 'sim_distance' as it appears on this page, but it actually uses the statement return 1/(1 + sum_of_squares), rather than the (correct) statement return 1/(1 + sqrt(sum_of_squares)). Apparently, the error is the result of using the wrong version of the function, while the correct version is printed .	Yehiel Milman	Sep 08, 2009
Printed	Page 11 Bottom line of python code section describing the function sim_distance	The sim_distance function should return 1/(1+sqrt(sum_of_squares)) rather than 1/(1+sum_of_squares) when inverting the Euclidean distance. The formula for Euclidean distance includes a square root, but the square root is never taken of the sum of all the squares in this segment of code.	Thea	Sep 10, 2009
Printed	Page 11 Euclidian Distance Score code snippet	In the code snippet for sim_distance: return 1/(1+sqrt(sum_of_squares)) Shouldn't this be return 1/(1+sqrt(sum_of_squares/len(si))) Also, in the example code zip on the companion website, this is return 1/(1+sum_of_squares) which should also be return 1/(1+sqrt(sum_of_squares/len(si))) I guess this error has no effect when getting recommendations for a single movie or critic; the resulting similarity values deviate, but keep the same order. However, this error has serious effect when ordering similarities over movies/critics. This is done in getting recommendations item based, on page 24. Am I correct? (If I'm not, please excuse me for this sunday morning observation yawn )	Koen Mannaerts	Sep 19, 2009
	11 Last code example	The correct similarity score is 0,294298055085 instead of 0.148148148148.	Anonymous	Dec 05, 2009
Printed	Page 11 after second paragraph	The numerical output values of some examples, in chapter 2, including the above seems to be due to an error in the Examples (PCI_Folder), chapter2, recommendations.py, line 38. Wrong: return 1/(1+sum_of_squares) Right : return 1/(1+sqrt(sum_of_squares))	Anonymous	Mar 30, 2010
PDF	Page 11 Second code block	# One would have had to execute and 'import' prior to a 'reload' ... >>> import recommendations >>> reload(recommendations)	Anonymous	Sep 01, 2010
Printed	Page 11 1st code snippet	>>> recommendations.sim_distance(recommendations.critics, 'Lisa Rose', ... 'Gene Seymour') 0.148148148148 should be: >>> recommendations.sim_distance(recommendations.critics, 'Lisa Rose', ... 'Gene Seymour') 0.294298055085 The output is wrong, but the sim_distance function is actually correct. My guess is the author forgot to update his code to reflect the code in the book. This error is also present in the downloadable version of the source code. It can easily be fixed by replacing the return statement in the sim_distance function in recommendations.py. Replace: return 1/(1+sum_of_squares) with: return 1/(1+sqrt(sum_of_squares))	Roshan Patel	Oct 31, 2010
Printed	Page 11 second para	recommendations.sim_distance(recommendations.critics, 'Lisa Rose', 'Gene Seymour') results in a similarity score of: 0.29429805508554946. >>> 1/(1+sqrt(5.75)) #where 5.75 is the 'sum_of_squares' 0.29429805508554946 Not taking the square root of the 'sum_of_squares' will result in the value printed in the book: >>> 1/(1+5.75) 0.14814814814814814	Anonymous	Nov 11, 2010
PDF	Page 11 1st code block	Earlier on in the chapter, the Euclidian Distance score is defined as: 1/(1+sqrt(sum_of_squares)) However, the code example on page 11 leaves out the square root. ---- snip ----- from math import sqrt # Returns a distance-based similarity score for person1 and person2 def sim_distance(prefs,person1,person2): # Get the list of shared_items si={} for item in prefs[person1]: if item in prefs[person2]: si[item]=1 # if they have no ratings in common, return 0 if len(si)==0: return 0 # Add up the squares of all the differences sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in prefs[person1] if item in prefs[person2]]) return 1/(1+sum_of_squares) ---- end snip ---- The last line of this snippet should be: return 1/(1 + sqrt(sum_of_squares)) In addition, the result in the interpreter output for the code example contains the incorrect value, since the formula was wrong. By the way, great book so far, I'm really enjoying it.	Abraham Lee	Jun 09, 2011
Printed	Page 11 sim_distance function	The square root is not applied in the sim_distance function as is described on the previous page (page 10) in the section titled Euclidean Distance Score. E.g. The last line of the function is: return 1/(1+sum_of_squares) but possibly could/should be: return 1/(1+sqrt(sum_of_squares)) If this is an error/omission it also follows through to the example usage of the function, also on page 11 directly beneath the function code and the example code download accompanying the book. If it is not an error/omission then I think an explanation should be included explaining why the square root is not applied; As it appears to contradict the explanation of how to calculate Euclidean Distance on page 10.	Anonymous	Oct 18, 2011
Other Digital Version	11 end of sim_distance function	It says: "return 1/(1+sqrt(sum_of_squares))" If you run this, you get a different result than stated in the book. The downloaded book code samples say: "return 1/(1+sum_of_squares)" Which gives the proper results, and the right results as shown in the book.	Anonymous	Dec 29, 2011
Printed	Page 11 2nd Para	On page 11, the author computes recommendations.sim_distance(recommendations.critics,'Lisa Rose', 'Gene Seymour') as 0.148148148148. When I programmed his algorithm into Python 2.7.3 as well as implemented it in MS Excel I get 0.294298055086. I can send the author a spreadsheet, as well as my Python implementation if needed.	Salil Athalye	Nov 08, 2012
Printed, PDF	Page 11 Top half of the page	I'm seeing three different versions of the code on this page. Most of them are giving an output of : "Out[3]: 0.29429805508554946" The sample code (which differs from the code given in the book) is the only code that gives the published output of: "0.14814814814814814" Which is correct?	Anonymous	Mar 17, 2013
Printed	Page 11 Euclidean Distance Score code	sum_of_squares in downloadable code (correct) sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in prefs[person1] if item in prefs[person2]]) sum_of_squares in printed code (incorrect) sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si])	Curie	Jul 01, 2015
Printed	Page 11 Euclidean Distance Score code	sum_of_squares in downloadable code-- yields (0.14814814814814814) sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in prefs[person1] if item in prefs[person2]]) return 1/(1+sum_squares) sum_of_squares in printed code -- yields (0.29429805508554946) sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si]) return 1/(1+sqrt(sum_squares))	Curie	Jul 01, 2015
Printed	Page 11 12th line in code	If there is nothing found, it is conventional to return -1 meaning not found. Like that it can be filtered out and otherwise it could tamper the results.	Orion	Jul 06, 2015
Printed	Page 11 last line in code	The return statement reads: return 1/(1+sum_of_squares) It should read: return 1/(1+sqrt(sum_of_squares)) Reason: After calculating the sum of each squared value, the square root of it should be calculated. Conclusion: The 'from math import sqrt' is not used and should be used as i have shown in the corrected version. Therefore the given output is also incorrect.	Orion	Jul 06, 2015
Printed	Page 11 python interpreter output	The similarity between Lisa Rose and Gene Seymour should be 0.294298055086 instead of 0.148148148148.	Aleix Conchillo Flaque	Aug 29, 2017
PDF	Page 12 United States	In Figure 2-3 on page 12, Dupree is plotted with a score of 1 from Jack Matthews, which doesn't match the score specified earlier (3.5).	Stephen Dewey	Nov 08, 2013
Printed	Page 13 sim_pearson code fragment	Definition of sim_pearson is subject to integer math errors, at least in Python 2.5.1. The number of overlapping elements is saved as "n = len(is)". This is used later to determine the numerator of the formula. It turns out this can allow the function to return values > 1, for this reason: 13 * 11 / 3 ==> 47 whereas: 13 * 11 / 3.0 ==> 47.666666666666664 Changing the initial line to be "n = float(len(si))" is enough to prevent this. As someone in a separate errata noted, the function should also return 1.0 if the denominator is 0.	Anonymous	Nov 14, 2008
PDF	Page 13 in the sample code	# if they are no ratings in common, return 0 if n==0: return 0 must be: # if they are no ratings in common, return 0 if n==0: return -1 because sim_pearson returns -1 if no match found, not 0 - zero means something about 50/50 matching, and this will give false results in all futher functions what use it, like topMatch e.g. at page [14]	Anonymous	Mar 20, 2009
PDF	Page 13 Code block (3rd or 4th paragraph depending on definition)	There were multiple problems with the code on this page. I answered this question: http://stackoverflow.com/questions/13558529/pearson-algorithm-from-programming-collective-intelligence-still-not-working/13559801#13559801 which provides working code for the example.	Raphael	Nov 25, 2012
PDF	Page 13 Second Half	The Pearson Algorithm doesn't work properly. I have pasted the working code for this section here at http://stackoverflow.com/a/13562198/1828663	Siddharth Menon	Nov 26, 2012
Printed	Page 13 11th line in code (in the comment)	It reads: #if they are no ratings in common, return 0 It should read: #if they have no ratings in common, return ... Reason: It is grammatically incorrect and not consistent with the code for the Euclidean Distance Score on page 11 as there it reads: #if they have no ratings in common, return 0 Conclusion: "are" should be replaced with "have" so it is consistent and grammatically correct	Orion	Jul 06, 2015
Printed	Page 14 first paragraph of "Ranking the Critics"	"learning which movie critics have tastes simliar to mine" ^^^^^^^	Anonymous	Jun 09, 2008
Printed	Page 15 Table 2-2	FinalScore = Total/Sim.Sum This will remove (or reduce) the weight of similarity. Let's imagine this scenario, I'm X A,B,C is very similar to X, let's say ABC has 0.9 similarity to X D,E,F is quite different from X, let's say DEF has 0.1 similarity to X A,B,C all rated Movie1 4.5 D,E,F all rated Movie2 4.6 So for Movie1 we get (4.50.9+4.50.9+4.5*0.9)/(0.9+0.9+0.9)=4.5 for Movie2, we get 4.6 which > Movie1 But I think this is a improper recommendation, as X should listen more to ABC, and ABC recommend Movie1. I think it may be better to calculate FinalScore = Total/sqrt(Sim.Sum) With this, we also can get consistent recommendation for the two examples in this book (Table 2-2 @page 15, and Table 2-3 page 24) Thanks,	Fuchen Ying	Nov 08, 2009
PDF	Page 16 Code	As person is not in the list, the second condition in the original code will never be fulfilled: # only score movies I haven't seen yet if item not in prefs[person] or prefs[person][item]==0: The code may ignore the zero scored items by the "other" person using this code: # only score movies I haven't seen yet that contribute to the recommendation if item not in prefs[person] or prefs[other][item]!=0.0:	ifernando	Jan 06, 2017
Printed	Page 17 2nd output	The output of getRecommendations(recommendations.critics, 'Toby', similarity=recommendations.sim_distance) is wrong. It should be: [(3.457128694491423, 'The Night Listener'), (2.778584003814924, 'Lady in the Water'), (2.422482042361917, 'Just My Luck')] You have fixed the error in page 11(1 / (1+sum_of_squares) -> 1 / (1+sqrt(sum_of_squares))), but didn't change the result calculated by the wrong expression.	Anonymous	Aug 08, 2019
Printed	Page 19 1st paragraph	According to Wikipedia, "On June 1, 2017, Delicious was acquired by Pinboard, and the service will be discontinued in favor of Pinboard's paid subscription-based service."	Anonymous	Sep 17, 2017
Printed	Page 20	In the pydelicious.py module, an exception is raised when "feedparser" is imported.	Anonymous	Oct 28, 2008
PDF	Page 20 5rd paragraph	New del.icio.us API uses "url" field instead of "href" field. All the text referring to "href" field should be updated.	Jingguo Yao	Nov 14, 2010
	20 United States	Upon running the install script for version 0.6, I get 'ImportError: No module named etree.ElementTree' How do you get the del.icio.us API to work??? There is absolutely no documentation for it. The documentation folder is empty and there is no documentation on the code download site. It doesn't even say which version of Python is needed! You should add the version of the API that works for you. "To make things even easier for you, there is a Python API that you can down-load from http://code.google.com/p/pydelicious/source or http://oreilly.com/catalog/9780596529321." I clicked on the book link to oreilly.com but did not find the API code there. I downloaded the book code and did not find it there either!	George Ulmer	Nov 26, 2011
ePub	Page 20 sim_distance result for 'Lisa Rose' and 'Gene Seymour'	The similarity score of 0.148148148148 given for Lisa Rose and Gene Seymour is wrong. It should be 0.29429805508554946.	Bob Felts	Oct 25, 2012
Printed	Page 21 1st piece of code	I think the API might have changed and doesn't include the 'href' key anymore. When I ran the sample code, it choked on this line in the first piece of code on the page: for p2 in get_urlposts( p1[ 'href' ] ) The Python interpreter complained about the 'href' key: >>> delusers = initializeUserDict( 'programming' ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "deliciousrec.py", line 8, in initializeUserDict # find all users who posted this KeyError: 'href'	Lina Faller	Jun 16, 2009
PDF, Other Digital Version	Page 21 Fourth line	The code: url=post['href'] Is no longer correct and generates a key error. It looks like href has been changed to url. Changing the line to this: url=post['url'] resolves the issue.	wackyvorlon	Mar 27, 2011
Printed	Page 21 top of page	I've read that the 'href' key was renamed to 'url' by the Delicious people, but even when I change the code in line 8 from for p2 in get_urlposts(p1['href']) to: for p2 in get_urlposts(p1['url']) I still got the same error: File "deliciousrec.py", line 8, in initializeUserDict for p2 in get_urlposts(p1['url']): TypeError: string indices must be integers, not str So... not sure where we stand on this, but any help would be greatly appreciated.	Morgan Thompson	May 16, 2013
Printed	Page 25 2nd code sample	The noted mistake on pg 11 with regards to calculating the similarity based on the distance metric propagates to this code sample as well. The numbers are slightly incorrect. They should really be: [(3.1667425234070894, 'The Night Listener'), (2.936629402844435, 'Just My Luck'), (2.868767392626467, 'Lady in the Water')]	radoshi	Oct 21, 2011
Printed	Page 31 2nd code sample	The code to split text into words uses this regular expression: [^A-Z^a-z] the above includes caret ('^') in the set with the alphabetic characters, so caret will not be used to split words. For example: >>> re.compile("[^A-Z^a-z]").split("foo^bar") ['foo^bar'] The correct regex is [^A-Za-z]: >>> re.compile("[^A-Za-z]").split("foo^bar") ['foo', 'bar']	Anonymous	Nov 17, 2009
Printed	Page 31 first code snippet	In the first for loop of the first code snippet: Instead of: for e in d.entries: if 'summary' in e: summary=e.summary	Roshan	Nov 02, 2010
Printed	Page 31 First codeblock	The return d.feed.title,wc will throw an AttributeError if any of the feed urls in feedlist.txt no longer exist. This can be catered for by catching the exception as follows: try: return d.feed.title,wc except AttributeError: return "nofeed",wc	Anonymous	Dec 16, 2010
Printed	Page 32 code listing after 3rd paragraph and another code listing following it	1. In the main loop the first line with function call title,wc=getwordcounts(feedurl) generates an uncaught exception: TypeError: 'NoneType' object is not iterable Adding the try..except fixes the problem: 2. In the second code listing feedlist variable is not defined	Dmitry Kan	Feb 13, 2010
Printed	Page 32 1st code block, 1st line in the for loop	The line feedlist.add(feedurl) throws the following error: AttributeError: 'list' object has no attribute 'add' This can be resolved by replacing the code above by: feedlist.append(feedurl)	Peter Tegelaar	Jan 07, 2011
Printed	Page 32 1st code snippet	feedlist.add(feedurl) add method doesn't exist. should probably read feedlist.append(feedurl)	Anders Jansson	Nov 17, 2012
Printed	Page 32 3rd code snippet	blog = blog.encode(`ascii`, `ignore) it's not single quotes, and the terminating quote on ignore is missing.	Anders Jansson	Nov 17, 2012
Printed	Page 32 7th line of last code block on page	Missing closing single quote for 'ignore' parameter on line that reads: blog = blog.encode('ascii', 'ignore)	Alan Esenther	Dec 20, 2012
Printed	Page 32 Top code snippet for "generatefeedvector.py"	The apcount map is supposed to contain "..the number of blogs each word appeared in", yet the code contains the following statement: for word,count in wc.items(): apcount.setdefault(word,0) if (count>1): apcount[word]+=1 in other words, apcount actually contains the number of blogs each word appeared in _more than once_. This is probably not what the author wants, and I propose the following change to the last two lines of code: if (count>0): apcount[word]+=1	Jens Roland	Mar 08, 2015
Printed	Page 35 Near end of def pearson, about 2/5 down the page	The Pearson correlation coefficient error that anonymous first identified on p. 13 also appears here. The line reading, num=pSum-(sum1sum2/len(v1)) should read num=pSum-(sum1sum2/float(len(v1))) to avoid integer roundoff errors.	Marisano James	Apr 26, 2010
Printed	Page 35 Near end of def pearson, about 2/5 down the page	This Pearson correlation coefficient function contains another slight error: den=sqrt((sum1Sq-pow(sum1,2)/len(v1))(sum2Sq-pow(sum2,2)/len(v1))) should read den=sqrt((sum1Sq-pow(sum1,2)/float(len(v1)))(sum2Sq-pow(sum2,2)/float(len(v1)))) to avoid integer roundoff errors.	Marisano James	Apr 26, 2010
Printed	Page 35 last line of pearson(v1, v2)	The Pearson correlation function on pg. 35 should end in "return num/den", not "return 1.0-num/den". On pg. 13, it ends in: r=num/den return r which is correct.	Scott Vokes	Nov 27, 2010
Printed	Page 44 Inset commands after 3rd paragraph, about 3/5 down the page	[rownames[r] for r in k[0]] should read, [blognames[r] for r in kclust[0]] Likewise, [rownames[r] for r in k[1]] should read, [blognames[r] for r in kclust[1]]	Marisano James	Apr 26, 2010
PDF	Page 44 4th paragraph code	A try catch block can be put in the code, as when the program fails to load data for a user posts will become undefined in the ensuing portion of code. def fillItems(user_dict): all_items={} # Find links posted by all users for user in user_dict: print user; for i in range(3): try: posts=get_userposts(user) break except: print "Failed user "+user+", retrying" time.sleep(4) try: // put this try-catch block additionally for post in posts: url=post['href'] user_dict[user][url]=1.0 all_items[url]=1 except: print 'no posts'	Anonymous	May 16, 2013
Printed	Page 46 First line of second paragraph of text.	In this line: "Once this is done, the code first has to create a list of items than more than five people want, ..." change "five" to "ten". Alternatively, change 10 to 5 in the code listing following this paragraph.	Alan Esenther	Dec 20, 2012
Other Digital Version	49 chapter 2, def sim_distance(prefs,person1,person2)	I have a electronic copy (Kindle) so I don't have the page number. In chapter 2: >>> recommendations.sim_distance(recommendations.critics,'Lisa Rose','Gene Seymour') 0.148148148148 The code in the book is different from the code that is available in the Examples file I just downloaded. The value presented in the book is 0.148148148148 and is the result of the buggy version in the example code that lacks the final call to sqrt. The value in the book should be 0.294298055086 which is the result of using the code in the book. Please tell me which code is correct, the one in the book or the one in the Examples (http://examples.oreilly.com/9780596529321/PCI_Code.zip).	Gabriel Becedillas	Jan 18, 2010
Printed	Page 50 Very bottom of page	The variable "outersum" is defined but never used.	Anonymous	Mar 17, 2009
Printed	Page 58 3rd paragraph code	The code on page 58 (for searchengine.py) did not work for me (using python 2.7 and beautifulsoup 4). I had to change the code to: links=soup.find_all('a') for link in links: if (link.has_attr('href')): hope that helps!	Ahmad Al Ben Ali	May 18, 2013
Printed	Page 62 2nd paragraph (suggested), just after def getentryid	The first edition of the book does not include the working definition for the addlinkref function. That is, as printed, the function only contains the pass statement, and unfortunately is never updated. This causes the code in the Inbound Links section to fail (particularly the PageRank algorithm). I suggest that the completed code be placed on p. 62, just after the getentryid definition. [The full function definition does appear in the PCI_code.zip file available via the Examples link, however.]	Marisano James	Aug 05, 2009
Printed	Page 62 second block of code	words=self.separateWords(linkText) should read words=self.separatewords(linkText) with a lowercase w	Quentin Groom	Dec 07, 2010
PDF	Page 63 cur=self.con.execute(fullquery)	When using a query not in the database I receive the following error >>> e.query('gorgonzola') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "searchengine2.py", line 223, in query rows,wordids=self.getmatchrows(q) File "searchengine2.py", line 201, in getmatchrows cur=self.con.execute(fullquery) pysqlite2.dbapi2.OperationalError: near "where": syntax error I intend to find a solution.	Lola Montes	Jun 25, 2013
PDF	Page 67 First Paragraph in sub-topic Cluster of Preferences	The website http://www.zebo.com/ seems to be no longer operational. Please take note of this.	Anonymous	May 16, 2013
Printed	Page 68 1st code segment of the "Word Distance" section; 3rd from last line	The line dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))]) contains an error that renders the code substantially less useful - often almost useless. Recall that Python indexing begins at zero. Thus for the line to work appropriately, it should read: dist=sum([abs(row[i]-row[i-1]) for i in range(1,len(row))]) Note that the lower bound of the range has been decremented by one. Summing across the word distances is correct, however (as explained in the text), unlike what was anonymously reported as an error on p. 69 below.	Marisano James	Aug 08, 2009
Printed	Page 68 1st code segment of the "Word Distance" section	Saw this a little while back. Ilya Verbitskiy writes that my errata and submitted fix are incorrect. I am no longer able to download the data to re-check my submitted error and fix (and I don't recall enough to make my own test case), but what I was saying is that although urlid is the first element, indexing begins at zero, so the element in the first (rather than the "zeroeth" position) is a word with some distance to other words the user queried. At the time I tried the code with and without the updated offset and surveyed the output of each as well.	Marisano James	May 22, 2016
Printed	Page 69 1st code segment (4th line down)	dist = sum([abs(rows[i] - row[i-1) for row in range(2, len(ow))]) This makes no sense. Presumably sum should be min. This is a lovely book, but I agree with an earlier submitter, very poorly proofread. This is especially the case as the code isn't commented. Even if the code were error free, it is quite terse and hard for a non Python expert to read. The errors really detract from readability.	Anonymous	Jul 28, 2009
Printed	Page 69 1st code segment of the "Word Distance" section	What Marisano James said is not right. Remember, the first element in the row tuple is the urlid. So it makes no sense to do the following: dist=sum([abs(row[i]-row[i-1]) for i in range(1,len(row))]). The code in the book is right, because the urlid of the row tuple must be ignored and the beginning is the second element in the row tuple (index 1).	Ilya Verbitskiy	Feb 17, 2012
Printed	Page 70 Figure 4-3	Page C has only three links to other pages. It should have four links to other pages.	Jarno Mielikainen	Dec 09, 2008
Printed	Page 70 Figure 4-3	The text states "C links to four other pages", but in the figure are only 3 additional arrows besides the one pointing to A. In the computation the total number of links on C is 5. Just a inconsistency.	Anonymous	Dec 19, 2008
Printed	Page 70 Last Paragraph	Looking at Figure 4-3, Page B and page C both have three links going to pages other than A. The text in the final paragraph states "B also has links to three other pages and C links to four other pages". Essentially, it looks like there should be an extra arrow coming out of C in the diagram to bring the total number of arrows to four.	Bryce Thomas	Apr 11, 2009
Printed	Page 70 Figure 4-3	Figure 4-3. Calculating the PageRank of A Page C links to A and four other pages the text says. This is confirmed by the calculation for PR(A) on page 71 where PR(C)/links(C) is defined as 0.7/5. The figure, however, only displays C's link to A and THREE other pages, which is correct for B, not C.	Maurice Jutte	Jan 31, 2011
Printed	Page 70 Figure 4-3	The outgoing links (arrows) from page C are wrong. In the text you read that "... and C links to four other pages.". Thus, the arrows going out in the drawing from page C should be 4 and not 3. This is also confirmed by the formula to compute PR(A) few lines below, where for page C you can read: PR(C)/links(C) and this is 0.7/5. Thanks, Christian	pezzin	Nov 08, 2012
Printed	Page 71 Beginning of code section, calculatepagerank	To be most effective, a call to self.dbcommit() should be inserted after the self.con.execute('drop table if exists pagerank') call (i.e. just after the first command of the calculatepagerank function). This permits calculatepagerank to be called multiple times in a single python session without the database returning an error stating that it already contains a pagerank table, and subsequently remaining locked due to the error.	Marisano James	Aug 05, 2009
Printed	Page 72 Second example down	the statement: for i in range(3): print cur.next() does nothing, why have a for loop if you are not going to use the index? Actually this can be fixed, and there are a lot of other typos in this book, to anyone looking for solutions here is what I did, I hope this saves you some time 1. look at the example code available to download, it is sometimes different from the code in the book 2. check here http://blog.kiwitobes.com/?p=44 to see if anyone else had the same problem as you	Peter Harrington	Feb 10, 2010
Printed	Page 79 #node outputs block and #hidden activations block	The list variable ai is a list of 1's. The value of ai never changes and it is only used to multiply against another list. That multiplication is redundant as it just results in the original list. Also, self.ai is created twice. Once in the middle of setupnetwork and at the start of feedforward. It is never changed and should result in the same value since setupnetwork is always run right before feedforward.	Anonymous	Jun 27, 2011
Printed	Page 80 5th Paragraph (Python console)	For the output of .getresult() (which comes from .feedforward()), it states: [0.76,0.76,0.76] However, I got: [0.076,0.076,0.076] Since these values are produced by TANH and the values are off by exactly one order-of-magnitude, I find it unlikely to be a decimal problem somewhere in the code, and more likely that it's a simple misprint.	Dustin Oprea	Aug 24, 2010
PDF	Page 81 last lines of last two paragraphs of the codes	The formulas to calculate output_deltas and hidden_deltas are wrong. According to the Delta Rule (http://en.wikipedia.org/wiki/Delta_rule), either these formulas should be error divided by dtanh, or the dtanh itself should be 1/(1-y*y).	Eric Wang	Mar 21, 2009
Printed	Page 81 1st paragraph	The derivative of tanh(y) is defined as 1-y*y. It is unclear how the author arrives at this result. Taking the Taylor expansion of tanh(y) around 0 results in y-y^3/3+(2 y^5)/15-(17 y^7)/315+(62 y^9)/2835-(1382 y^11)/155925+O(y^12). Throwing away the higher order terms and differentiating, we arrive at tanh'(y)=1-y^2, a result which holds around y=0.	Peter Tegelaar	May 05, 2011
Other Digital Version	87 2nd paragraph - code	When I type code below into my Notepad (Win32) from optimization.py, I get the following error? flights={} # for line in file('schedule.txt'): origin,dest,depart,arrive,price=line.strip().split(',') flights.setdefault((origin,dest),[]) Error message: 'ValueError: need more than 1 value to unpack'. I've tried everything and I can't seem to get around this error. I'm new so maybe I'm doing something wrong??	Anonymous	Jul 01, 2012
Printed,	Page 87 3rf Paragraph (ignoring code sample)	The paragraph references a link to retrieve the sample data: http://kiwitobes.com/optimize/schedule.txt But this is a dead link. It looks like all the data (for this and other chapters) has been moved to the sample code zip file but there's no reference to this on the support page for the book.	Russell	Oct 26, 2014
Printed	Page 88 Code definition for printschedule method	for d in range(0, len(r), 2): should be for d in range(0, len(people)): Iterating over the schedule is incorrect, the list of people should be iterated over. This is why the out and ret variables are retrieved via r[2d] and r[2d+1], respectively. As is, this causes an index out of bounds exception since the length of the schedule is twice the length of the number of people.	Jakob Homan	Jun 15, 2009
Printed	Page 88,90 Throughout schedulecost and printschedule method definitions	All instances of determining the return flight (returnf and ret variables) should have their destination and origin assignments switched, in order to find a flight back from LGA. As printed, the code finds the same flight for both origin and return.	Jakob Homan	Jun 15, 2009
Printed	Page 88 bottom of page	The index for the line beginning with out= should be [int(r[2d])] rather than [r[d]], and the index for the line beginning with ret= should be [int(r[2d+1])] rather than [r[d+1]]. There is no need to change the range of the for loop in the printschedule function, however, or to reverse any of the destination and origin assignments, as was erroneously reported elsewhere in this errata list.	Anonymous	Nov 10, 2009
PDF	Page 88-94	optimization.py def printschedule(r): for d in range(len(r)/2): name=people[d][0] origin=people[d][1] out=flights[(origin,destination)][int(r[2d])] # corr ret=flights[(destination,origin)][int(r[2d+1])] # corr print '%10s%10s %5s-%5s $%3s %5s-%5s $%3s' % (name,origin, out[0],out[1],out[2], ret[0],ret[1],ret[2]) def schedulecost(sol): totalprice=0 latestarrival=0 earliestdep=2460 for d in range(len(sol)/2): # Get the inbound and outbound flights origin=people[d][1] outbound=flights[(origin,destination)][sol[2d]] # corr returnf=flights[(destination,origin)][sol[2d+1]] # corr # Total price is the price of all outbound and return flights totalprice+=outbound[2] totalprice+=returnf[2] # Track the latest arrival and earliest departure if latestarrival<getminutes(outbound[1]): latestarrival=getminutes(outbound[1]) if earliestdep>getminutes(returnf[0]): earliestdep=getminutes(returnf[0]) # Every person must wait at the airport until the latest person arrives. # They also must arrive at the same time and wait for their flights. totalwait=0 for d in range(len(sol)/2): origin=people[d][1] outbound=flights[(origin,destination)][sol[2d]] # corr returnf=flights[(destination,origin)][sol[2d+1]] # corr totalwait+=latestarrival-getminutes(outbound[1]) totalwait+=getminutes(returnf[0])-earliestdep # Does this solution require an extra day of car rental? That'll be $50! if latestarrival<earliestdep: totalprice+=50 # corr return totalprice+totalwait def randomoptimize(domain,costf): best=999999999 bestr=None for i in range(0,1000): # Create a random solution r=[random.randint(domain[i][0],domain[i][1]) # corr for i in range(len(domain))] # Get the cost cost=costf(r) # Compare it to the best one so far if cost<best: best=cost bestr=r return bestr # corr def hillclimb(domain,costf): # Create a random solution sol=[random.randint(domain[i][0],domain[i][1]) for i in range(len(domain))] # Main loop while 1: # Create list of neighboring solutions neighbors=[] for j in range(len(domain)): # One away in each direction if sol[j]>domain[j][0] and sol[j]<domain[j][1]: # corr neighbors.append(sol[0:j]+[sol[j]+1]+sol[j+1:]) neighbors.append(sol[0:j]+[sol[j]-1]+sol[j+1:]) if sol[j]==domain[j][0]: # corr neighbors.append(sol[0:j]+[sol[j]+1]+sol[j+1:]) if sol[j]==domain[j][1]: # corr neighbors.append(sol[0:j]+[sol[j]-1]+sol[j+1:]) # See what the best solution amongst the neighbors is current=costf(sol) best=current for j in range(len(neighbors)): cost=costf(neighbors[j]) if cost<best: best=cost sol=neighbors[j] # If there's no improvement, then we've reached the top if best==current: break return sol def annealingoptimize(domain,costf,T=10000.0,cool=0.95,step=1): # Initialize the values randomly vec=[random.randint(domain[i][0],domain[i][1]) # corr for i in range(len(domain))] ----------------------------------------------------------- I haven't looked at the geneticoptimize() function, but it also seems problematic. Executing it I get 4616 4282 3893 3610 Traceback (most recent call last): File "glass_gen", line 3, in <module> s=optimization.geneticoptimize(domain,optimization.schedulecost) File "/root/Collective_Intelligence/optimization.py", line 184, in geneticoptimize scores=[(costf(v),v) for v in pop] File "/root/Collective_Intelligence/optimization.py", line 43, in schedulecost for d in range(len(sol)/2): TypeError: object of type 'NoneType' has no len() ----------------------------------------------------- p. 91 quotation: In theory, you could try every possible combination, but in this example there are 16 flights, all with 9 possibilities, giving a total of 916 (around 300 billion) combinations. Testing every combination would guarantee you'd get the best answer. However, the total cost of the outbount flights is independent of the cost of the return flights, so the minimization of the outbound and return costs can be done separately. Six outbound and return flights yield 26**9 (20155392) combinations, which makes testing every combination an easy task. Thus,the optimal solution is [4, 3, 3, 3, 4, 3, 3, 4, 4, 3, 4, 3] cost: 2306, [1102, 81, 942, 181, 0] Seymour BOS 12:34-15:02 $109 10:33-12:03 $ 74 Franny DAL 10:30-14:57 $290 10:51-14:16 $256 Zooey CAK 12:08-14:59 $149 10:32-13:16 $139 Walt MIA 11:28-14:40 $248 12:37-15:05 $170 Buddy ORD 12:44-14:17 $134 10:33-13:11 $132 Les OMA 12:18-14:56 $172 11:07-13:24 $171	Olov Einarsson	Sep 12, 2010
Printed	Page 88 function def printschedule(r)	The function defprintschedule(r) uses d to key into both the people and r arrays. This results in an error in logic when executing the function. # Algorithm from book did not work, added d = 0 and personIndex variable in for loop. Made personIndex as key index into the people array and used d as out and ret key index. This is necessary because each consecutive location in the r array contains a pair of origin, destination locations, so for each person, array index i and index i+1 refer to the flight number for a round trip. Solution: d = 0 for personIndex in range(len(r)/2): name = people[personIndex][0] origin = people[personIndex][1] out = flights[(origin, destination)][r[d]] ret = flights[(destination, origin)][r[d+1]] d = d + 2 print '%10s%10s %5s-%5s $%3s %5s-%5s $%3s' % (name, origin, out[0], out[1], out[2], ret[0], ret[1], ret[2])	Anonymous	Nov 15, 2011
Printed	Page 88 4th paragraph	In this case, each number can represent which flight a person chooses to take, where 0 is the first flight of the day, 1 is the second, and so on. Since each person needs an outbound flight and a return flight, the length of this list is twice the number of people. For example, the list: [1,4,3,2,7,3,6,3,2,4,5,3] Represents a solution in which Seymour takes the second flight of the day from Boston to New York, and the fifth flight back to Boston on the day he returns. Franny takes the fourth flight from Dallas to New York, and the third flight back. Should be : Represents a solution in which Seymour takes the second flight of the day from Boston to New York, and the (seventh) flight back to Boston on the day he returns. Franny takes the (fifth) flight from Dallas to New York, and the (fourth) flight back.	Arie Schlesinger	Mar 26, 2012
Printed	Page 89 code example output	there has been previous report on some other issues on the example, e.g airport name is nowhere in schedule.txt and not programmed, also, the discussion and code are a bit inconsistent. Now, when running example code, the output is very different from the printout in the book as well. In summary, the illustration here is very confusing.	Anonymous	Jun 10, 2008
PDF	Page 90 the last 3 and 4 lines	int(sol[d]) should be int(sol[2d]) in(sol[d+1]) should be int(sol[2d+1]) and therefore the calculate result at the next page should be 5469	Eric Wang	Mar 23, 2009
Printed	Page 90 3rd paragraph, first sentence	The text states that, "There are a huge number of possibilities for the getcost function defined here." There is no getcost function, however; "getcost" should be replaced with "schedulecost".	Marisano James	Nov 06, 2009
Printed	Page 91 3rd paragraph	In theory, you could try every possible combination, but in this example there are 10 flights, all with 6 possibilities, giving a total of 6^10 combinations. instead of In theory, you could try every possible combination, but in this example there are 16 flights, all with 9 possibilities, giving a total of 9^16 combinations. b'cause 6 family members having 10 flights to LGA	Anonymous	Jun 21, 2008
Printed	Page 91 3rd paragraph	I believe that calculation should be: (10 flights to LGA * 10 flights from LGA) ^ (# of family members) = (10^2)^6 = 10^12 or in Python, (102)6 = 10**12	Anonymous	Nov 06, 2009
Printed	Page 91 3rd paragraph	Oh, and 9**16 is much closer to 300 trillion than it is to 300 billion	Anonymous	Nov 06, 2009
Printed	Page 92 1st code snippet	> return r should be > return bestr	Anonymous	Oct 23, 2008
Printed	Page 92 first code snippet	The randomoptimize function should return bestr, not r.	Andy Young	Jan 20, 2009
Printed	Page 92 2nd line of Python session section (toward top of page)	The domain goes from 0-10, i.e. 0..9 rather than 0..7. (See schedule.txt for confirmation; also on p. 97, 8 could not be included as part of the solution if the highest available value were 7.) This means the line specifying the domain should read: domain=[(0,10)](len(optimization.people)2)	Marisano James	Nov 07, 2009
Printed	Page 93 hillclimb function code sample	The spelling of neighbors is inconsistent in the code sample. All instances of the word neighbors/neighbours in the code sample should be changed either to "neighbors" or "neighbours", but not a mix of both.	Bryce Thomas	May 06, 2009
Printed	Page 93 bottom third of page, about halfway through the code section	The code as printed in the first edition of the book, allows for negative index references (in Python these will not cause index out of range errors, but will instead read from the opposite end of the list) and does not explore the full domain. One way to address these shortcomings is to change the domain to take on actual values from 0 up to 9 inclusive, and to make the following changes: if sol[j]>domain[j][0]: should be changed to read, if sol[j] > domain[j][0] and sol[j] < domain[j][1] - 1: Likewise, if sol[j]<domain[j][1]: should read, if sol[j] < domain[j][1] and sol[j] > domain[j][0]: Another solution is to adjust the range as stated, but then to enforce a wrap-around at the ends of the range (this approach allows the algorithm to test all of the neighboring schedules that comprise the local minimum), for j in range(len(domain)): # One way in each direction. # # Test modified index again the upper bound. if sol[j] > domain[j][0]: if sol[j]+1 < domain[j][1] - 1: value = sol[j]+1 else: value = domain[j][0] neighbors.append(sol[0:j] + [value] + sol[j+1:]) # The lower upper bound. if sol[j] < domain[j][1]: # Use the mod function to wrap around. neighbors.append(sol[0:j] + [(sol[j]-1)%domain[j][1]] + sol[j+1:])	Marisano James	Nov 07, 2009
Printed	Page 95	The probability of a higher-cost solution being accepted should read: p = e^(-(highcost-lowcost)/temperature)	Sercan Taha Ahi	Dec 13, 2009
Printed	Page 96 Chapter 5	calculation of probability is wrong. Is: p=pow(math.e,-eb-ea/T) must be: p=pow(math.e,-(eb-ea)/T) reason: if eb<ea, eb is accepted anyway if eb>ea, then eb-ea is how much the solution got worse. If this is almost 0, then why not accepting (p becomes almost 1)? if this is very big, it should be less likely to accept the change, at least at low temperatures! -eb-ea also "somehow" works, because the temperature decreases enough. However, since p is quite small, there is not much annealing and we have to start from very high temperatures to have some p's close to 1.	Anonymous	Dec 29, 2009
Printed	Page 98 mutate function definition	The mutate function is missing an else clause (or some other mechanism for returning a default value). A possible solution would be to remove the elif conditional and guarantee that the function always returns a value: # Mutate operation def mutate(vec): i=random.randint(0,len(domain)-1) if random.random()<0.5 and vec[i]>domain[i][0]: return vec[0:i]+[vec[i]-step]+vec[i+1:] else: return vec[0:i]+[vec[i]+step]+vec[i+1:]	Jeremy Mason	Dec 08, 2008
Printed	Page 98 code example in function def mutate	the mutate function returns None if neither the if or elif are true. This results in None being appended to the pop list which causes badness in the cost function. suggest adding the following to the end of the mutate function else: return vec	Matt Mercer	Mar 01, 2009
PDF	Page 98 the mutate function of code example	the mutate function should has a else clause which return vec. Otherwise None will be added into pop occasionally	Eric Wang	Mar 24, 2009
PDF	Page 98 In the geneticoptimize function	Firstly in this function correct the mistyped 'mutprod' to 'mutprob' The mutate function goes out of bounds for different step values plus the previously reported error so here is my corrected version def mutate(vec): i=random.randint(0,len(domain)-1) # Corrected so different step values can be added if random.random()<0.5 and vec[i]-step>domain[i][0]: return vec[0:i]+[vec[i]-step]+vec[i+1:] elif vec[i]+step<domain[i][1]: return vec[0:i]+[vec[i]+step]+vec[i+1:] else: return vec Also during the mainloop mutations/crossovers are made but not scored, so my solution is to make maxiter +1 and in the last iteration just score it without adding mutations; # Main loop for i in range(maxiter): scores=[(costf(v),v) for v in pop] scores.sort() if i == maxiter-1: pass # Corrected from here for last iteration else: ranked=[v for (s,v) in scores] # Start with the pure winers pop=ranked[0:topelite] # Add mutated and bred forms of the winners while len(pop)<popsize: if random.random()<mutprob: # Mutation c=random.randint(0,topelite) pop.append(mutate(ranked[c])) else: # Crossover c1=random.randint(0,topelite) c2=random.randint(0,topelite) pop.append(crossover(ranked[c1],ranked[c2])) # Print current best score print scores[0][0] return scores[0][1]	Lola Montes	Jul 02, 2013
Printed	Page 101 peter.tegelaar@gmail.com	When you follow the Kayak API link (http://www.kayk.com/labs/api/search), you get the following error message: 'Kayak Search API is no longer supported. Because of costly misuse of the search API, KAYAK no longer can support it. Sorry.'	Peter Tegelaar	Apr 11, 2011
Printed	Page 108 second code example	for i in range(len(dorms): slots += [i,i] should be: for i in range(len(dorms)): slots += [i,i]	Anonymous	Jun 12, 2008
Printed	Page 109 bottom	The output of dorm.printsolution(s) uses the value of s as generated by s=optimize.randomoptimize(), for a solution cost of 18 (the original generated solution.) optimize.geneticoptimize() is then called, but the results are not reassigned to s, and the output is confusing. It might be more clear if this session were changed to: reload(dorm) s=optimize.randomoptimize(dorm.domain,dorm.dormcost) dorm.printsolution(s) dorm.dormcost(s) r=optimize.geneticoptimize(dorm.domain,dorm.dormcost) dorm.printsolution(r) so that the before and after can be seen.	Anonymous	Oct 24, 2009
Printed	Page 112 code, bottom	when calculating ua and ub, the book contains this code ua = ((x4 - x3) * (y1 - y3) - (y4 - y3) * (x1 - x3)) / den ub = ((x2 - x1) * (y1 - y3) - (y2 - y1) * (x1 - x3)) / den this works well for python 3.x but not for python 2.x if you are using python 2.x you should instead write: ua = ((x4 - x3) * (y1 - y3) - (y4 - y3) * (x1 - x3)) / float(den) ub = ((x2 - x1) * (y1 - y3) - (y2 - y1) * (x1 - x3)) / float(den) (by default python 2.x uses integer division when using the '/' operator together with two integers)	Ilya Verbitskiy	Apr 02, 2012
PDF	Page 113 drawnetwork()	The PIL wasn't showing me any results so I print it out as a a jpg instead from PIL import Image, ImageDraw def drawnetwork(sol): jpeg='network_diagram.jpg' # Create the image img=Image.new('RGB',(400,400),(255,255,255)) draw=ImageDraw.Draw(img) # Create the position dict pos=dict([(people[i],(sol[i2],sol[i2+1])) for i in range(0,len(people))]) # Draw links for (a,b) in links: draw.line((pos[a],pos[b]),fill=(255,0,0)) # draw people for n,p in pos.items(): draw.text(p,n,(0,0,0)) img.save(jpeg,'JPEG')	Lola Montes	Jul 05, 2013
Printed	Page 121 2nd sample code block	After importing docclass, the db should be initialized. cl.setdb('test.db') Or this line can be added to docclass.py	Anonymous	May 21, 2008
PDF	Page 121 Calculating Probabilities	"In this case, you can calculate the probability that a word is in a particular category by dividing the number of times the word appears in a document in that category by the total number of documents in that category." in this sentence, it indicated that the Probability of a word = total number times the word appears / total count of document But I think it is wrong , right one must as follow two case: A. Probability of a word = total number times the word appears/ total count of the word in all document B. Probability of a word = total number times the word appears / total count of the word in the document.	Ricky	Jan 05, 2013
Printed	Page 123 1st paragraph	The function for the calculation of the weighted probabilities (weightedprob) is wrong. In more details, the variable "total" is calculated in a wrong way. "total" is a weighting factor for "basicprob" (probability to find a document with a given feature in a given category). In the book "total" is calculated as "the number of times this feature has appeared in all categories". However, the weighting factor "total" should be equal to the number of items in a considered category.	Anonymous	Oct 09, 2009
PDF	Page 126 4th Paragraph	The line states that a new instance variable should be added to classifier when in fact it should be added to naivebayes, as should the other methods on the page.	Lola Montes	Jul 09, 2013
Printed	Page 129 very last line of code	I believe the last argument to the invchi2 function should not be multiplied by 2. That is, it should be: return self.invchi2(fscore, len(features)) rather than: return self.invchi2(fscore, len(features)*2)	Roy Pardee	Mar 12, 2009
PDF	Page 129 United States	Fisher's method is described as: "The Fisher method involves multiplying all the probabilities together, then taking the natural log (math.log in Python), and then multiplying the result by ?2. Add this method to fisherclassifier to do this calculation:" And indeed, this is how it is implemented. However, as far as I can tell, Fisher's method, according to other sources, is the summation of the natural logarithm of each observation, not the natural logarithm of the product sum of each observation. In practice, it works out because log(xy) = log(x) + log(y) and summation is distributive, but this explanation caused me some serious confusion at first when I tried to find more information about this technique.	Anonymous	Mar 26, 2013
	132-133 Do not know	The SQL docclass.py classification SQLite example should be coded using the SQL ? substitution model rather than using Python's method. The functions incf() and fcount() generate program exceptions when the feature input contains a quotation mark (and perhaps other characters as well). For example, the SQLite statement: self.con.execute("insert into fc values ('%s','%s',1)" % (f,cat)) should instead be coded: self.con.execute("insert into fc values (?,?,1)", (f,cat)) The latter (improved) format allows SQL to handle input that would otherwise cause problems.	Thornton Martin	Dec 06, 2009
Printed	Page 137 code sample	In the entryfeatures function, summarywords is produced by calling the lower() method on each string produced by splitting the summary text. The function later tries to count uppercase words by calling the isupper() method on each string in summarywords. Since these strings have already been converted to lower case, isupper() will never return true.	Anonymous	Dec 29, 2009
Printed	Page 137 At the code example	I found a minor issue with the code example at page 137. The code tries to detect overuse o uppercase use but it fails because it converts all the words to lowercase in the second list comprehension of the function. summarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20] The code to detect the uppercase abuse correctly is (see that I called lower() in f[w.lower()] and in f[towowords.lower()]: def entryfeatures(entry): splitter = re.compile('\\W*') f = {} # Extract the title words and annotate titlewords = [s.lower() for s in splitter.split(entry['title']) if len(s) > 2 and len(s) < 20] # Extract the summary words summarywords = [s for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20] # Count uppercase words uc = 0 for i in range(len(summarywords)): w = summarywords[i] if (w.isupper()): uc += 1 f[w.lower()] = 1 # Get word pairs in summary as features if i < len(summarywords) - 1: twowords = ' ' . join(summarywords[i:i+1]) f[twowords.lower()] = 1 # Keep creator and publisher whole f['Publisher:'+ entry['publisher']] = 1 # UPPERCASE is a virtual word flagging too much shouting if (float(uc) / len(summarywords) > 0.3): f['UPPERCASE'] = 1 return f	Eriksen Costa	Mar 21, 2010
PDF	Page 138 Top of the page	Just a mild typo but had me confused for a couple of minutes :) What it says: If you want to use this new version with filterfeed, you'll have to change the function to pass the entries to the classifier rather than to fulltext. Just change the end to: ....... What it means: Change the ending of feedfilter.read();	Lola Montes	Jul 11, 2013
PDF	Page 144 2nd Paragraph	Importing decision_tree_example.txt keeps "\n"s in the data, so remove them or this taints the data example from uniquecounts() {'Basic\n': 4, 'Premium\n': 1, 'Basic': 1} This is how I felt like doing it today my_data=[line.split('\t') for line in file('decision_tree_example.txt')] for a in range(len(my_data)): for b in range(len(my_data[a])): my_data[a][b] = my_data[a][b].replace('\n','')	Lola Montes	Jul 22, 2013
PDF	Page 146 Top of Page	divideset() was splitting the numerical-data as string-data because that was is it was stored in my_data. So, instead of splitting >=value it was splitting ==value The corrected version produces the same tree, but with different datasets who knows? def divideset(rows,column,value): try: value = int(value) # Convert to int here except: pass split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row: int(row[column])>=value # Convert here too else: split_function=lambda row:row[column]==value etc......................	Lola Montes	Aug 01, 2013
PDF	Page 148 In the second indented block	In the line Entropy = sum of p(i) x log(p(i)) for all outcomes By what the author mean, Entropy should be the minus of that. It's probably a typo, since in the implementation below, the computation didn't forget the minus sign.	jvc	Jul 25, 2014
Printed	Page 150 buildtree function	The recursive calls at the end of the buildtree function should propagate the scoref parameter. Otherwise if you use a scoref function besides the default "entropy" function it will only be used on the first call. So instead of trueBranch=buildtree(best_sets[0]) it should be trueBranch=buildtree(best_sets[0], scoref)	Stan Dyck	May 05, 2009
PDF	Page 155 last line	The line delta=entropy(tb+fb)-(entropy(tb)+entropy(fb)/2) should be delta=entropy(tb+fb)-(entropy(tb)+entropy(fb))/2	Xiaochen Wang	Jun 04, 2012
Printed	Page 157 bottom of page, second last line of code	mdclassify function is called qualified by treepredict2, but treepredict2 is neither imported (doesn't exist anyway) or defined anywhere. Changing treepredict2.mdclassify(['google', 'France',None,None],tree) to simply treepredict.mdclassify(['google', 'France',None,None],tree) appears to generate the intended results.	Bryce Thomas	May 09, 2009
Printed	Page 157 code sample at bottom of page	The reported results for the first test of the mdclassify function are different than those produced by the code in the downloadable code samples. Here's what I get with Python 2.6.4: >>> tree = treepredict.buildtree(treepredict.my_data) >>> treepredict.mdclassify(['google', None, 'yes', None], tree) {'Premium': 2.25, 'Basic': 0.25}	Anonymous	Jan 05, 2010
Printed	Page 160 getaddressdata function code sample top half of page	It appears as though the Zillow API does not return "totalRooms" anymore (assuming it once did). Furthermore, some of the houses that it searches for appear to return no actual values from Zillow, or still cause the exception block to be hit. Later on, when the code asks for len(row) in the variance method of treepredict.py, this will cause "TypeError: object of type 'NoneType' has no len()". As a work around: rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data should be commented out. Also, in getpricelist() function, change l1.append(data) to if data != None l1.append(data) this should prevent null rows getting added and therefore prevent the TypeError exception described above.	Bryce Thomas	May 09, 2009
Printed	Page 161 Modelling "Hotness"	Hot or Not API is no longer available, so Hot or Not stuff will not work.	Bryce Thomas	May 09, 2009
Printed	Page 175 top half of page, gaussian function and interactive interpreter sample	The gaussian function shown at the top of the page does not produce the results shown in the interactive interpreter sample code. E.g. numpredict.gaussian(1.0) produces 0.99501247... not 0.606530659... as shown. To get results which match that of the sample interactive interpreter, change the gaussian function to: def gaussian(dist,sigma=10.0): return math.e*(-(dist10)*2/(2sigma**2)) This way, results match what's shown on the page and are then also consistent with what's illustrated in figure 8-5.	Bryce Thomas	May 10, 2009
PDF	Page 182 5th Paragraph	The pdf says to call geneticoptimize using; optimization.geneticoptimize(numpredict.weightdomain,costf,popsize=5,lrate=1,maxv=4,iters=20) I assume this was an different version of geneticoptimize(), as the values accepted by geneticoptimize are geneticoptimize(domain,costf,popsize=50,step=1,mutprob=0.2,elite=0.2,maxiter=100) I tried this instead, guessing that 'maxv' related to 'elite'. optimization.geneticoptimize(numpredict.weightdomain,costf,popsize=5,step=1,elite=0.2,maxiter=20)	Lola Montes	Aug 05, 2013
Printed	Page 186 code example	numpredict.cumulativegraph(data, (1,1), 6) the "high" price 6 doesn't make sense. according to the figure 8-10, the high should be at least 120.	Anonymous	Jun 19, 2008
Printed	Page 186 interactive interpreter code sample	the cumulativegraph function is being passed the vector (1,1), which means it will draw the cumulative probability for the price of bottles of wine which have a rating of 1 and are 1 year old. These wine bottles would be so terrible that the cumulative probability would reach 1 at a price of 0, meaning there's nothing to see on the graph generated. A better example would be to use say a wine bottle rating of 99 that's 10 years old, with a call like: numpredict.cumulativegraph(data,(99,10),120) This way, they'll actually be something useful displayed on the graph when its printed and it should look more like Figure 8-10 does.	Bryce Thomas	May 10, 2009
Printed	Page 188 interactive interpreter code sample	Like the error on page 186, using the vector (1,1) - a wine bottle with a rating of 1 and 1 year old, means that the wine would be so bad that it would all cost essentially 0. Instead, use a vector like say (99,10). Also, the value of 6 used for high is quite small. To get something that looks similar to figure 8-11, the value 120 should be used instead. So, the line which reads: numpredict.probabilitygraph(data,(1,1),6) should become something like numpredict.probabilitygraph(data,(99,10),120)	Bryce Thomas	May 10, 2009
Printed	Page 199 code sample	The paragraph after the code sample states "The points will be O if the people are a match and X if they are not." The code sample however draws a scatterplot where people that are a match are represented with a green O and people that are not a match are represented with a red O (not an X). To get a scatterplot that uses red X's instead of red O's, change the line: plot(xdn,ydn,'ro') to plot(xdn,ydn,'rx') Alternatively, to get a scatterplot that looks like that in figure 9-1 (which uses neither O's nor X's), change the same line: plot(xdn,ydn,'ro') to plot(xdn,ydn,'r+')	Bryce Thomas	Oct 13, 2009
Printed	Page 199 middle of page (code section)	If the plotagematches function is really to be called from one's Python session (rather than being added to the advancedclassify.py file) then one does not need to reload advancedclassify, and should only type "plotagematches(agesonly)" instead of "advancedclassify.plotagematches(agesonly)" into the Python session.	Marisano James	Nov 01, 2009
Printed	Page 204 First Paragraph	There is a sentence which states "There are two other points, X0 and X1, which are examples that have to be classified". The diagram on the other hand uses points X1 and X2, but no X0. I believe that the sentence should be changed to "There are two other points, X1 and X2, which are examples that have to be classified."	Bryce Thomas	Oct 15, 2009
Printed	Page 208 top of page, 2nd line of getlocation function body	The URL for Yahoo! Maps Web Services has changed. The line reading, 'http:\\api.local.yahoo.com/MapServices/V1/'+\ should now read, 'http://local.yahooapis.com/MapsService/V1/'+\	Marisano James	Nov 02, 2009
Printed	Page 209 halfway down the page	Apparently there are two 824 3rd Avenues in New York city: one that's approximately 0.9 miles from 220 W 42nd St, and another that's about 6.6 miles away. It might be good to include zip codes with some, or all, of the addresses from the matchmaker.csv file to avoid confusion. [The new Yahoo! Maps Services returns the latitude and longitude for the second address - the one that's circa 6.6 miles away - as the default rather than those of the address that's approx. 0.9 miles away, as listed in the book.]	Marisano James	Nov 02, 2009
Printed	Page 210 In the scaledata function	In the book, scaleinput should be defined as: # Create a function that scales data def scaleinput(d): return [(d[i] - low[i]) / (high[i] - low[i]) for i in range(len(low))] In the printed copy, the return statement tries to access d.data[i], even though the data member was passed in to scaleinput, in both scaledata and in the interpreter example below the code.	Anonymous	Aug 15, 2008
Printed	Page 213 top	The printed first edition does not include the source code for the veclength function, def veclength(v): return sum([p**2 for p in v]) Perhaps it should be placed toward the top of page 213. [The function does appear in the file PCI_code.zip, available via the Examples link <http://examples.oreilly.com/9780596529321/>.]	Marisano James	Nov 03, 2009
Printed	Page 213 top	Just above the definition of the rbf function there should be an import math statement to ensure that the constant math.e is defined.	Marisano James	Nov 03, 2009
Printed	Page 213 top of page, in rbf definition	To be in accordance with the other gammas, and with the code listed in PCI_code.zip, the first line of the rbf function definition should read, def rbf(v1, v2, gamma=10): For the purposes of the book, I believe that the gamma parameter is always explicitly passed, however.	Marisano James	Nov 03, 2009
Printed	Page 213 bottom third of page (2nd from last line of the nlclassify function)	The line, if y<0: return 0 should read, if y>0: return 0 This is in accordance with the code included in PCI_code.zip and also allows the offset used on page 214 to be meaningfully calculated. (See below.) Without this change, each result returned by nlclassify will be the opposite of its intended value.	Marisano James	Nov 03, 2009
Printed	Page 214 Interactive interpreter sample at top of page	Before you can execute any of the statements, you first need to reload advancedclassify using: reload(advancedclassify) The book also never defines the variable "offset" which is being parsed into the nlclassify function. I'm not sure what a reasonable value for this would be.	Bryce Thomas	Oct 15, 2009
Printed	Page 214 top of page, just before the first call to nlclassify	Just after the missing reload(advancedclassify) statement (See Bryce's comment for more), one should execute offset=advancedclassify.getoffset(agesonly) To properly define the otherwise undefined offset. This should produce a value of some -0.0076450020098023288.	Marisano James	Nov 03, 2009
PDF	Page 217 Beginning 3rd paragraph	The current version of the Python Wrapper for LIBSVM,(3.17), doesn't support the commands in the PDF. Two solutions: 1: Download version 2.91 from http://www.csie.ntu.edu.tw/~cjlin/libsvm/oldfiles/ and make from the python_old directory. 2: Using the following commands found on Stack Overflow Note I couldn't get cross-validation to work though :? from svm import * from svmutil import * svm_model.predict = lambda self, x: svm_predict([0], [x], self)[0][0] prob = svm_problem([1,-1], [[1,0,1], [-1,0,-1]]) param = svm_parameter() param.kernel_type = LINEAR param.C = 10 m=svm_train(prob, param) m.predict([1,1,1]) svm_save_model('test.model',m) m = svm_load_model('test.model') answers,inputs=[r.match for r in scaledset],[r.data for r in scaledset] param = svm_parameter() param.kernel_type = RBF prob = svm_problem(answers,inputs) m=svm_train(prob, param) newrow=[28.0,-1,-1,26.0,-1,1,2,0.8] # Man doesn't want children, woman does m.predict(scalef(newrow)) newrow=[28.0,-1,1,26.0,-1,1,2,0.8] # Both want children m.predict(scalef(newrow))	Lola Montes	Aug 23, 2013
Printed	Page 218 top	The lines reading, m.save(test.model) m=svm_model(test.model) should read, m.save('test.model') m=svm_model('test.model') Also, in order for the svmc.pyd file to work with your Python environment, it must be of the correct version. The svmc.pyd file included in PCI_code.zip is for Python 2.4, the one available from the LIBSVM website (as of November 3, 2009) is for Python 2.6. I used this website to get a version that works with Python 2.5: <http://www.cs.sunysb.edu/~algorith/implement/libsvm/distrib/>. [It's also possible to build the svmc.pyd with C.]	Marisano James	Nov 03, 2009
Printed	Page 219 Facebook section	The code for facebook has change dramatically since the publishing and no longer functions. Use of the book code returns a "This API version is deprecated" <?xml version="1.0" encoding="UTF-8"?> <error_response xmlns="http://api.facebook.com/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://api.facebook.com/1.0/ http://api.facebook.com/1.0/facebook.xsd"> <error_code>12</error_code> <error_msg>This API version is deprecated</error_msg> <request_args list="true"> <arg>	Jeffery Shipman	Apr 08, 2009
Printed	Page 230 makematrix() function	Non-Negative Matrix Factorization is well defined only when there all no all-zero rows and columns. The makeMatrix() function, by excluding "words that are common but not too common," can eliminate all the word in some documents yielding all-zero rows. The evidence that this has occurred is error messages and NaNs in the output. It can be fixed two ways. A quick-and-dirty fix is to eliminate fewer words in makematrix(); this can be accomplished with simple changes to the if statement. A more robust fix is removing the resulting all-zero rows.	Scott Ainsworth	May 03, 2009
PDF	Page 238 factorize(v, pc=10, iter = 50) function	As of D. Shrestha, "Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents.," pp. 1-10. The parameter epsilon << 1 is added to the multiplicative update rule to avoid division by zero, thus avoiding a NaN in the algorithm. This is added to: wd=(whtranspose(h))+0.000000001 and hd=(transpose(w)wh)+0.000000001	Pavel Soriano	Mar 15, 2010
Printed	Page 238 Factorize function	Hi there, When I run the code of chapter 10 that comes with the book Programming Collective Intelligence I get the following error: news.py:121: RuntimeWarning: invalid value encountered in divide w=matrix(array(w)*array(wn)/array(wd)) news.py:86: RuntimeWarning: invalid value encountered in double_scalars dif += pow ( a [ i, j ] - b [ i, j ], 2 ) nan I have been looking for possible explanations today, but I can't find any. Could you please help? Kind regards, Thomas van Pelt the Netherlands	Anonymous	Dec 27, 2012
PDF	Page 238 Last paragraph	retuI found this version of factorize() that deals with the appearance of NAN's on GitHub Enjoy :) def factorize(v, pc=10, iter=50): ic, fc = shape(v) w = matrix([[random.random() for j in range(pc)] for i in range(ic)]) h = matrix([[random.random() for i in range(fc)] for i in range(pc)]) for i in range(iter): wh = w * h cost = difcost(v, wh) if i % 10 == 0: print("%d: %f" % (i, cost)) if cost == 0 or math.isnan(cost): print("%d: %f" % (i, cost)) break hn = transpose(w) * v hd = transpose(w) * w * h h = matrix(array(h) * array(hn) / array(hd)) wn = v * transpose(h) wd = w * h * transpose(h) # RuntimeWarning: invalid value encountered in divide wd = [[1e-20 if (x - 1e-20 < 0) else x for x in lst] for lst in wd.tolist()] w = matrix(array(w) * array(wn) / array(wd)) return w, hrn w, h	Lola Montes	Aug 30, 2013
PDF	Page 238 Last paragraph	""" Try this version of NMF that is about 15/20 times faster and in my opinion more accurate in the results it gives - lolamontes69 # NMF by alternative non-negative least squares using projected gradients # Author: Chih-Jen Lin, National Taiwan University # Python/numpy translation: Anthony Di Franco # Adapted for the Chapter 10 Programming Collective Intelligence by lolamontes69 see end for usage """ import random as random from numpy import from numpy.linalg import norm from time import time from sys import stdout def factorize(v1, pc, maxiter): ic, fc = shape(v1) temp1=v1.tolist() v=array(temp1) w = array([[random.random() for j in range(pc)] for i in range(ic)]) h = array([[random.random() for i in range(fc)] for i in range(pc)]) wo,ho = nmf(v, w, h, 0.00000001, 900, maxiter) temp2=wo.tolist() wo1=matrix(temp2) temp3=ho.tolist() ho1=matrix(temp3) return wo1, ho1 def nmf(V,Winit,Hinit,tol,timelimit,maxiter): """ (W,H) = nmf(V,Winit,Hinit,tol,timelimit,maxiter) W,H: output solution Winit,Hinit: initial solution tol: tolerance for a relative stopping condition timelimit, maxiter: limit of time and iterations """ W = Winit; H = Hinit; initt = time(); gradW = dot(W, dot(H, H.T)) - dot(V, H.T) gradH = dot(dot(W.T, W), H) - dot(W.T, V) initgrad = norm(r_[gradW, gradH.T]) print 'Init gradient norm %f' % initgrad tolW = max(0.001,tol)initgrad tolH = tolW for iter in xrange(1,maxiter): # stopping condition projnorm = norm(r_[gradW[logical_or(gradW<0, W>0)], gradH[logical_or(gradH<0, H>0)]]) if projnorm < tolinitgrad or time() - initt > timelimit: break (W, gradW, iterW) = nlssubprob(V.T,H.T,W.T,tolW,1000) W = W.T gradW = gradW.T if iterW==1: tolW = 0.1 * tolW (H,gradH,iterH) = nlssubprob(V,W,H,tolH,1000) if iterH==1: tolH = 0.1 * tolH if iter % 10 == 0: stdout.write('.') print '\nIter = %d Final proj-grad norm %f' % (iter, projnorm) return (W,H) def nlssubprob(V,W,Hinit,tol,maxiter): """ H, grad: output solution and gradient iter: #iterations used V, W: constant matrices Hinit: initial solution tol: stopping tolerance maxiter: limit of iterations """ H = Hinit WtV = dot(W.T, V) WtW = dot(W.T, W) alpha = 1; beta = 0.1; for iter in xrange(1, maxiter): grad = dot(WtW, H) - WtV projgrad = norm(grad[logical_or(grad < 0, H >0)]) if projgrad < tol: break # search step size for inner_iter in xrange(1,20): Hn = H - alphagrad Hn = where(Hn > 0, Hn, 0) d = Hn-H gradd = sum(grad d) dQd = sum(dot(WtW,d) * d) suff_decr = 0.99gradd + 0.5dQd < 0; if inner_iter == 1: decr_alpha = not suff_decr; Hp = H; if decr_alpha: if suff_decr: H = Hn; break; else: alpha = alpha * beta; else: if not suff_decr or (Hp == Hn).all(): H = Hp; break; else: alpha = alpha/beta; Hp = Hn; if iter == maxiter: print 'Max iter in nlssubprob' return (H, grad, iter) """ *********** Example Usage *********** import newsfeatures as newsfeatures import welcome_to_the_nmf as new_nmf from numpy import * allw,artw,artt=newsfeatures.getarticlewords() wordmatrix,wordvec=newsfeatures.makematrix(allw,artw) v=matrix(wordmatrix) weights,feat=new_nmf.factorize(v,pc=20,maxiter=5000) topp,pn=newsfeatures.showfeatures(weights,feat,artt,wordvec,out='todays_features.txt') newsfeatures.showarticles(artt,topp,pn,out='todays_articles.txt') ------------------------------------------------------------------------------- ***************************************** Variables in line 20: have fun with them :) ***************************************** wo,ho = alt_nmf.nmf(v, w, h, tol, timelimit, iter) v is the matrix converted to an array w and h are the two randomly generated factors of v converted to arrays tol is the degree of error allowed before stopping timelimit is the time allowed before stopping iter the iterations to perform before stopping The usage of factorize() is the same as using factorize in Chapter 10 but about 15/20 times faster and it appears more accurate too :D Note: can be optimized further """	Lola Montes	Sep 04, 2013
Printed	Page 261 3rd line from bottom of page	isinstance(t, node): should read hasattr(t, 'children'): The former does not function correctly (under Python 2.5.2 at any rate).	Marisano James	Oct 30, 2009
Printed, PDF, ePub, Mobi	Page 261 def selectindex():	As some wrote, function selectindex() can return value that is greater or equal to popsize. This results an error in caller function. I think this solves the problem: # Add uniform function to import from random from random import random, randint, choice, uniform # Compute constants once xmin = pexp**(popsize-2) den = log(pexp) def selectindex(): """ xmin prevents functionf from risisng error by restricting rand. value """ return int(log(uniform(xmin, 1.0)) / den)	Karol Olszanowski	Aug 19, 2016
Printed	Page 263 4th line from the bottom of the page	The line if isinstance(t1, node) and isinstance(t2, node): should read if hasattr(t1, 'children') and hasattr(t2, 'children'): The former does not allow the program to converge to the correct solution. [The latter code snippet also appears in the book's source code as distributed in PCI_Code.zip.]	Marisano James	Oct 30, 2009
Printed	Page 265 def selectindex():	Function selectindex() can return value that is greater or equal to popsize. This results an error in caller function.	Vladimir Aksenov	Dec 21, 2011
Printed	Page 271 toward the bottom of the tournament function	As currently presented, the when there is a tie between players i and j it is as if player i lost. Presently the code reads, elif winner==-1: losses[i]+=1 losses[i]+=1 pass However, it should read, elif winner==-1: losses[i]+=1 losses[j]+=1 pass Applying this change (replacing the second losses[i] with losses[j]) will assist in evolving improved game-playing AI. [The above error appears in the online code, PCI_code.zip, as well.]	Marisano James	Oct 30, 2009
Printed	Page 313 9th line	The command should be $ tar xvf numpy-1.0.2.tar instead of $ tar xvf numpy-1.0.2.tar.gz	Anonymous	Oct 08, 2008
Printed	Page 317 In def pearson, top 1/3 of page	The Pearson correlation coefficient error that anonymous first identified on p. 13 also appears here. The line reading, n=len(x) should read n=float(len(x)) to avoid integer roundoff errors in a subsequent division [i.e. in num=pSum-(sumx*sumy/n)].	Marisano James	Apr 26, 2010
	322 in the function veclength	In veclength, the coefficients a[i] need to be squared: def veclength(a): return sum([a[i]2 for i in range(len(a))]).5 to properly compute the length.	Anonymous	Sep 03, 2012
ePub	Page 2425 (Kindle) United States	In the gettextonly method in the FTS chapter: the subtext is bound to None sometimes raising exceptions and causing the indexing to fail. Handle that via a None check for subtext: if subtext!=None: resulttext+=subtext+'\n' return resulttext	H kelkar	Apr 08, 2013
ePub	Page 2434 (kindle) United States	In the FTS chapter there is a separatewords function [s.lower() for s in splitter.split(text) if s!=''] encounters non strings sometimes Handle that by placing it within a try except block try: statement except: print "statement" continue Not hanadling that case will cause indexing to fail.	Anonymous	Apr 08, 2013