Errata

Errata for Hands-On Large Language Models

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version	Location	Description	Submitted By	Date submitted
	Page Chaptwe 12, Section 3: Training No Reward Model Figure 12-32	Figure 12-32 At the very bottom of the figure, it says, "Increase likelihood of rejected generation." Shouldn't it say "decrease" instead of "increase," as we are trying to optimize a trainable model to have a lower likelihood of generating rejected samples? Note from the Author or Editor: This is a typo and should indeed be "decrease" instead of "increase. "	Anonymous	Sep 17, 2024
	Page Chapter 1, Section "Interfacing with Large Language Models", subsection "Open Models", "NOTE" box "NOTE" Box, second sentence	The sentence misstates the meaning of "permissive" by stating that it means a model "cannot" be used for X purposes. The opposite is true, e.g.: - a permissive license means a model "can" be used for X purposes - a restrictive license means a model "cannot" be used for X purposes Original text: "For instance, some publicly shared models have a permissive commercial license, which means that the model cannot be used for commercial purposes." Note from the Author or Editor: This is indeed a typo and should have been "restrictive" instead of "permissive". This is most likely an artifact of an earlier version of the book.	Aaron Carver	Sep 19, 2024
	Page Choosing a Single Token from the Probability Distribution (Sampling/Decoding) 6th paragraph (1-indexing)	In the code example where the prompt "The capital of France is" is tokenized and passed through a language model, the textbook states that the expected shape of the lm_head_output tensor is [1, 6, ...] – and I think this may be incorrect. Since the prompt tokenizes into 5 tokens, the actual shape of `lm_head_output` should be [1, 5, ...] as the model hasn't yet predict the next token. The model provides outputs for each input token without generating additional tokens unless explicitly programmed to do so. When checking the most probable next token, it does so by checking the last position ([-1]) but that index corresponds to 4 and not 5. Also, I think that adding the following code snipped would enhance a lot this example: > [print(f"{tokenizer.decode(input_ids[0][:i+1])} -> {tokenizer.decode(t_pred_next.argmax(-1))}") for i, t_pred_next in enumerate(lm_head_output[0])]; The -> code The capital -> of The capital of -> the The capital of France -> is The capital of France is -> Paris As it states clear what's generated and what's "only yet predicted" at each steps. Let me know your thoughts on my reasoning – looking forward to your answer! Thanks for all the amazing content and knowledge shared with the community over the past years. Note from the Author or Editor: Correct, in the example on page 80, the shape of the vector should be [1, 5, 32064] instead of [1, 6, 32064]. Thank you! Thank you for submitting this and the explanatory code idea! It's a god example!	María Benavente	Sep 22, 2024
	Page Page 49 Final Bullet Point	The final bullet point on the page is a duplicate of the preceding bullet point. This was confusing since the bullet is fairly long and I thought I was missing a distinction between the two. Note from the Author or Editor: Confirmed duplicate bullet point that should be deleted. Thanks for submitting this!	Evan Oman	Oct 26, 2024
	Page https://learning.oreilly.com/library/view/hands-on-large-language/9781098150952/ch03.html#:-:text=Th 4th bullet in the bullet list at the end of the specified section	This is a very, very minor suggestion. The text of interest is: "Each of these Transformer blocks includes an attention layer and a feedforward neural network (also known as an mlp or multilevel perceptron). We’ll cover these in more detail later in the chapter." I believe that an MLP is more commonly known as a multilayer perceptron rather than a multilevel perceptron. For instance, Wikipedia refers to it as a multilayer perceptron, and the top search results for a Google search for "multilevel perceptron" are results for "multilayer perceptron". Thank you for your consideration of this feedback. Note from the Author or Editor: Suggestions are never too minor! Agreed, that should be "multilayer" instead of "multilevel".	Alvin G.	Dec 14, 2024
	Page Page 19 Figure 1-22	"I am llamas", should be "I love llamas" as on previous examples Note from the Author or Editor: In the image, "am" should indeed be "love" instead.	Dmytro Nikolaiev	Dec 23, 2024
	Page Page 79 Forth bulletpoint	Interpret MLP as "multilevel perceptron", should be "multilayer perceptron" Note from the Author or Editor: Agreed, that should be "multilayer" instead of "multilevel".	Dmytro Nikolaiev	Dec 23, 2024
	Page Page 201 Paragraph before the tip	"For now, it's important to know that we will use an 8-bit variant of Phi-3" but the code example uses 16-bit version Note from the Author or Editor: 8-bit should indeed be changed to 16-bit.	Dmytro Nikolaiev	Dec 23, 2024
	Page Page 299 In the middle of the page	"during evaluation" is mentioned in description of both per_device_train_batch_size and per_device_eval_batch_size Note from the Author or Editor: "during evaluation" should indeed be "during training" for the `per_device_train_batch_size`.	Dmytro Nikolaiev	Dec 23, 2024
	Page Page 339 After the code snippet	Mentions "F1 score 0.85" while the code shows "0.8363" which should be rounded to "0.84" Note from the Author or Editor: This is indeed a rounding error and should be 0.84 instead of 0.85.	Dmytro Nikolaiev	Dec 23, 2024
	Page Pages 340-341 Figures 11-14 and 11-15	Figures 11-14 and 11-15 contain "[CLS] What a horrible movie [MASK]!", should be "[CLS] What a horrible [MASK]!" Note from the Author or Editor: That is correct, and we indeed should remove the word "movie" here.	Dmytro Nikolaiev	Dec 23, 2024
	Page Page 370 lora_alpha parameter description	"A rule of thumb is to choose a value twice the size of r" but the value is half the r instead Note from the Author or Editor: The rule of thumb is indeed correct but the code should be updated. Instead of: ```python # Prepare LoRA Configuration peft_config = LoraConfig( lora_alpha=32, # LoRA Scaling lora_dropout=0.1, # Dropout for LoRA Layers r=64, # Rank bias="none", task_type="CAUSAL_LM", target_modules= # Layers to target ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj'] ) ``` It should be instead: ```python # Prepare LoRA Configuration peft_config = LoraConfig( lora_alpha=128, # LoRA Scaling lora_dropout=0.1, # Dropout for LoRA Layers r=64, # Rank bias="none", task_type="CAUSAL_LM", target_modules= # Layers to target ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj'] ) ``` Another option would be to decrease r to 32 and set lora_alpha to 64 but that would require additional testing.	Dmytro Nikolaiev	Dec 23, 2024
	Page Page 17, Figure 1-19 Figure 1-19	In the figure, the transformer encoder output is mentioned as "I am a student" while the decoder output is something different "I love llamas". This was supposed to be a translation transformer from English to Dutch according to the flow of the chapter leading to this place. This is confusing and needs re-check for possible error. Note from the Author or Editor: In Figure 1-19, "I am a student" should be "Ik houd van lama's" instead.	Rajaram	Dec 24, 2024
	Page Chapter 5 Figure 5.14	In the numerator of figure 5.14, the “average frequency of all words across all clusters”, might be incorrect. We have 3 clusters, the total words frequency across all clusters is: (9 + 9 + 7 + 22 + 1 + 9 + 1) = 58 so: A = 58 / 3 ~= 19.3 Note from the Author or Editor: This should indeed be 19.3 instead of the shown 29.	Davide Restivo	Feb 08, 2025
	Page Chapter 6 Figure 6-1	LLama2 model description is missing a '/' Wrong: 7B/13B70B Correct: 7B/13B/70B Note from the Author or Editor: That is indeed a typo and should be updated.	Davide Restivo	Feb 09, 2025
	Page Figure 3-5 Figure 3-5	Figure 3-5. The tokenizer has a vocabulary of 50,000 tokens. The model has token embeddings associated with those embeddings. Should be Figure 3-5. The tokenizer has a vocabulary of 50,000 tokens. The model has token embeddings associated with those tokens. Note from the Author or Editor: Agreed. Thank you!	Anonymous	Feb 09, 2025
	Page Page 127, Section name: Text classification with generative models first paragraph	These models take as input some text and generative text and are thereby aptly named sequence-to-sequence models. "generative" is a typo, should be generate: These models take as input some text and generate text and are thereby aptly named sequence-to-sequence models. Note from the Author or Editor: That is correct. It should indeed be "generate" instead of "generative".	Du Liu	Mar 08, 2025
	Page Page 359, Parameter-Efficient Fine-Tuning (PEFT) 3rd paragraph	In Paragraph 3, you say, "... fine-tuning 3.6% of the parameters of BERT for a task can yield comparable performance to fine-tuning all the model's weights.". This is somewhat incorrect since the Adapters paper adds new weights that are trained and whose number is 3.6% of the original model's weights. It is not fine-tuning 3.6% of the parameters of the original BERT model. Note from the Author or Editor: Thank you for the errata! That is indeed correct. It should be made clearer that it is about adding parameters (which are derived from the original model). So, although the 3.6% is still accurate, additional text is needed to make clear that these parameters are not only derived from the original model but also compressed.	Sameer Indarapu	Mar 21, 2025
	Page 30 Last 2 paragraphs	In the Open Models section, it might be worth further emphasizing the difference between open weight access (which still limits transparency and reproducibility) vs. fully open source access (providing the full codebase and information required for retraining the model from scratch). I believe this distinction should be highlighted as they have very different implications for AI research and responsible development (having access just to the weights limits thorough investigation of training dynamics, biases, and ethical concerns). As an example of a fully open source model, it might be worth citing OLMo, which provides unrestricted access to the training dataset, codebase, weights, evaluation suite, and intermediate checkpoints. Note from the Author or Editor: Great feedback. We initially focused on demonstrating open weights vs. closed weights (so only the model is potentially shared) since that covered 90% of use cases. In practice, it seldom happens that the model's parameters code (and especially datasets) are shared alongside it. As a result, people most often talk about sharing a model's parameters as "open source" (even though that strictly does not cover it). In an updated version, I fully agree that further highlighting that mentions of "open source" in the community typically means "open weights" and not shared code. It especially doesn't do justice to initiatives that are true "open source", like the OLMo model that the reader mentioned. Likewise, although DeepSeek did not release the training data (and as such is not fully "open source"), they did release many libraries for inference and training that go way beyond just "open weights". It would be nice to add this distinction to an updated version alongside with the grey area that exist between "open weights" and "open source" (also considering different types of licenses, see MIT/Apache vs. more restrictive licenses).	Diego Carpintero	Nov 24, 2024
	Page 49 GPT-2 Colorized Token Output	The example output does not match the actual output when running the code. For example: the spaces/tabs do not have multiple tokens. Note from the Author or Editor: Good catch. The colored tokens should show spaces inside the quotation marks as in the notebook (https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter02/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb#scrollTo=89JpLpZmlhXr) for the gpt2 example. Thank you!	Sean Moreno	Feb 19, 2025
	Page 49 Final bullet point	Is this point conflating 'tab' and 'space'? Note from the Author or Editor: Correct. "and the four spaces are represented as three tokens" should say instead: "and the three tabs are represented as three tokens" Thank you!	Sean Moreno	Feb 19, 2025
	Page 51 final bullet point under the section "GPT-4(2023)"	The final bullet point under the section "GPT-4(2023)" says "Refer back to what we said about the GPT-2 tokenizer with regards to the L tokens". But the "GPT-2(2019)" section (on page 49) doesn't say anything about "L tokens", instead it has a duplicated bullet point. The 5th bullet point duplicates the 4th bullet point. Note from the Author or Editor: This is referring to the emoji and Chinese character (????鸟). The correction here is that instead of the Ł symbol that appears in the book, it should show ????鸟 (the music note emoji and the Chinese character). Thank you!	Du Liu	Feb 16, 2025
	Page 57 2nd paragraph	"Instead of representing each token or word with a static vector, language models create contextualized word embeddings (shown in Figure 2-8) that represent a word with a different token based on its context." I think the tokenization locks the token ID - it is the embedding that is not static, but depending on context. Therefore the text would be correct and more clear like: Instead of representing each token ID with a static vector, language models create contextualized embeddings (shown in Figure 2-8) that represent a token ID with a different vector based on its context. Note from the Author or Editor: Agree that it's more accurate if the sentence said "embedding" instead of "token": >"... that represent a word with a different _embedding_ based on its context."	Jakob Riishede Møller	Jan 12, 2025
	Page 59 3rd paragraph	in this code example, there appears to be a mismatch mixing different model versions in the AutoTokenizer and AutoModel. When looking at the papers, the vocab size 'deberta-v3' is 128k whereas in 'deberta-base' was 50k, and the internal architecture have also some differences. This might lead to issues. A tentative fix might be to use 'microsoft/deberta-v3-base' as the tokenizer: tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base") instead of: tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base") ## See Chapter 2: Tokens and Embeddings, page 59 from transformers import AutoModel, AutoTokenizer # Load a tokenizer tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base") # Load a language model model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall") # Tokenize the sentence tokens = tokenizer('Hello world', return_tensors='pt') # Process the tokens output = model(tokens)[0] Note from the Author or Editor:** Agreed. Nice catch! We do need to change the tokenizer. Thank you!	Diego Carpintero	Nov 24, 2024
	Page 120 Figure 4-9	Accuracy formula: Shouldn't divisor be TP + TN + FP + FN Note from the Author or Editor: That is indeed a typo. The image shows `TP + TN + FP + FP` But it should be instead `TP + TN + FP + FN` Note the last value (`FP`), this should be updated (`FN`).	Marcel Neuhausler	Nov 04, 2024
	Page 146 1st Code box	On page 145, the code is currently written as: ``` to_plot = df.loc[df.cluster != "-1", :] outliers = df.loc[df.cluster == "-1", :] ``` However, I believe the correct code should be: ``` clusters_df = df.loc[df.cluster != "-1", :] outliers_df = df.loc[df.cluster == "-1", :] ``` This adjustment aligns with the code provided on your official GitHub repository, ensuring consistency. Note from the Author or Editor: This was indeed a typo, most likely a leftover from earlier versions. Considering we updated this in the official GitHub repository, it should indeed be updated to: ```python clusters_df = df.loc[df.cluster != "-1", :] outliers_df = df.loc[df.cluster == "-1", :] ```	Junpyo Lee	Nov 11, 2024
	Page 150 Figure 5-14	Should numerator of the first term of IDF log calculation should be 2.8 not 29? 2.8 = Sum of frequencies in Figure 5-13 divided by count of frequencies = 58/21 Note from the Author or Editor: It's the other way around, it should be 58/3 (number of clusters) instead (see related errata).	Michael Shearer	Dec 23, 2024
	Page 170 First Paragraph - Controlling Model Output	The para says "In our previous example, you might have noticed.....we used several parameters...including temperature and top_p.". But this is not correct. The previous example does not use temperature or top_p. Note from the Author or Editor: The previous example indeed does not use `temperature` or `top_p`, so this sentence can indeed be removed.	RAJARAM	Jan 10, 2025
	Page 202 in the code section commented below # Select outliers and non-outliers (clusters)	variable name used to save the outliers and nonoutliers used in the next section o the code for ploting is outliers_df and clusters_df. instead of those variable name to_plot and outliers variable name is used Note from the Author or Editor: This is correct. On page 145 of the book, there are two lines of code that create the `to_plot` and `outliers` variables. The names of these variables should be `clusters_df`and `outliers_df`respectively.	Dharmendra Rajak	Jan 24, 2025
	Page 244 Third paragraph	Reference to "Pretrained transformers for text tanking: BERT and beyond..." should be "Pretrained transformers for text ranking: BERT and beyond..." 'Ranking' is spelt 'Tanking' Note from the Author or Editor: Correct, definitely should be "ranking". "text tanking" does sound like a great name for a technique ;)	Liam Bluett	Dec 09, 2024
	Page 252,253 code snippet	Page 252, last paragraph of the book (printed version) reads: "we'll start by downloading a quantized model", but in the code snippet it uses the model "Phi-3-mini-4k-instruct-fp16.gguf" as follows: !wget [...]Phi-3-mini-4k-instruct-fp16.gguf llm = LlamaCpp( model_path="Phi-3-mini-4k-instruct-fp16.gguf", n_gpu_layers=-1, max_tokens=500, n_ctx=2048, seed=42, verbose=False ) - "Phi-3-mini-4k-instruct-fp16.gguf" is according to the model card not a quantized version, but a full precission 16 bit version. - The quantized version for this model is listed in the card as: "Phi-3-mini-4k-instruct-q4.gguf", a 4-bit quantized version. Note from the Author or Editor: That is indeed a typo as we expected to use the 4-bit quantized version. So instead of updating the text, we updated the code instead to reflect that in the official GitHub repository. Therefore, we should update it to `model_path="Phi-3-mini-4k-instruct-q4.gguf"` instead (which was done in the repo).	Diego Carpintero	Dec 14, 2024
	Page 253 "Loading the embedding model" section	The authors say they're going to use the "BAAI/bge-small-en-v1.5" model but in their code snippet they use "model_name='thenlper/gte-small'" Note from the Author or Editor: The code snippet is where the update should take place. There, the `model_name` should be `BAAI/bge-small-en-v1.5`. This was already updated in the GitHub repo to reflect the text.	Liam Bluett	Oct 22, 2024
	Page 297 2nd Code box	On page 297, the code currently appears as: dataset[2] Since train_dataset was defined earlier in the code, the correct reference should be: train_dataset[2] This change ensures consistency with the previous variable definition and the official GitHub repository. Note from the Author or Editor: Correct, `dataset[2]` should be `train_dataset[2]` instead. This was updated in the GitHub repo to reflect this.	Peter Kulisek	Dec 11, 2024
Printed	Page 300 The tip box	On page 300 of the printed book, you will find a tip box mentioning the batch size in MNR loss. This is the same tip box as on page 307 and the former can be safely ignored. Do note though that the larger batch sizes in general also help speedup training, especially if you have enough VRAM. Likewise, the tip does not relate only to MNR losses but to more losses that share similar mechanics. In other words, upping the batch size is seldom a bad idea if your device can handle it.	Maarten Grootendorst	Nov 13, 2024
	Page 313 Third sentence	The authors state "we use the remaining 400,000 sentence pairs (from our original dataset of 50,000 sentence pairs)". This should be 40,000 not 400,000. Note from the Author or Editor: Correct, this is indeed a typo and should be "40,000" instead of "400,000".	Liam Bluett	Oct 22, 2024
Printed	Page 338 bottom	On the bottom of the page, it mentions there are 2032=680 samples. This should be 640 samples instead. The same applies to 6802= 1280. Here it should also be 640 instead of 680.	Maarten Grootendorst	Dec 10, 2024
	Page 341 Footnote	"How to fine-tune GERT for text classification?" should be "How to Fine-Tune BERT for Text Classification?" BERT is spelt GERT, typo. Note from the Author or Editor: That's an interesting typo. That is a citation copied over from scholar I believe, which obviously contains BERT and not GERT. On a closer inspection, the footnote seems to be updated manually in later stages of the book's development, so perhaps it was a result of manually capitalizing the word. Either way, should be BERT.	Liam Bluett	Dec 05, 2024
	Page 364 Second paragraph	Two 12,228 x 2 matrices should be x 8 Note from the Author or Editor: There are actually two mistakes here: * 12,288 x 2 should be 12,288 x 8 * 197K should be 98K	Michael Shearer	Dec 27, 2024