The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
| Version |
Location |
Description |
Submitted By |
Date submitted |
Date corrected |
|
Page Chaptwe 12, Section 3: Training No Reward Model
Figure 12-32 |
Figure 12-32
At the very bottom of the figure, it says, "Increase likelihood of rejected generation." Shouldn't it say "decrease" instead of "increase," as we are trying to optimize a trainable model to have a lower likelihood of generating rejected samples?
Note from the Author or Editor: This is a typo and should indeed be "decrease" instead of "increase. "
|
Anonymous |
Sep 17, 2024 |
|
|
Page Chapter 1, Section "Interfacing with Large Language Models", subsection "Open Models", "NOTE" box
"NOTE" Box, second sentence |
The sentence misstates the meaning of "permissive" by stating that it means a model "cannot" be used for X purposes. The opposite is true, e.g.:
- a permissive license means a model "can" be used for X purposes
- a restrictive license means a model "cannot" be used for X purposes
Original text:
"For instance, some publicly shared models have a permissive commercial license, which means that the model cannot be used for commercial purposes."
Note from the Author or Editor: This is indeed a typo and should have been "restrictive" instead of "permissive". This is most likely an artifact of an earlier version of the book.
|
Aaron Carver |
Sep 19, 2024 |
|
|
Page Choosing a Single Token from the Probability Distribution (Sampling/Decoding)
6th paragraph (1-indexing) |
In the code example where the prompt "The capital of France is" is tokenized and passed through a language model, the textbook states that the expected shape of the lm_head_output tensor is [1, 6, ...] – and I think this may be incorrect.
Since the prompt tokenizes into 5 tokens, the actual shape of `lm_head_output` should be [1, 5, ...] as the model hasn't yet predict the next token. The model provides outputs for each input token without generating additional tokens unless explicitly programmed to do so.
When checking the most probable next token, it does so by checking the last position ([-1]) but that index corresponds to 4 and not 5.
Also, I think that adding the following code snipped would enhance a lot this example:
> [print(f"{tokenizer.decode(input_ids[0][:i+1])} -> {tokenizer.decode(t_pred_next.argmax(-1))}") for i, t_pred_next in enumerate(lm_head_output[0])];
The -> code
The capital -> of
The capital of -> the
The capital of France -> is
The capital of France is -> Paris
As it states clear what's generated and what's "only yet predicted" at each steps.
Let me know your thoughts on my reasoning – looking forward to your answer! Thanks for all the amazing content and knowledge shared with the community over the past years.
Note from the Author or Editor: Correct, in the example on page 80, the shape of the vector should be [1, 5, 32064] instead of [1, 6, 32064]. Thank you!
Thank you for submitting this and the explanatory code idea! It's a god example!
|
María Benavente |
Sep 22, 2024 |
|
|
Page Page 49
Final Bullet Point |
The final bullet point on the page is a duplicate of the preceding bullet point. This was confusing since the bullet is fairly long and I thought I was missing a distinction between the two.
Note from the Author or Editor: Confirmed duplicate bullet point that should be deleted. Thanks for submitting this!
|
Evan Oman |
Oct 26, 2024 |
|
|
Page https://learning.oreilly.com/library/view/hands-on-large-language/9781098150952/ch03.html#:-:text=Th
4th bullet in the bullet list at the end of the specified section |
This is a very, very minor suggestion. The text of interest is:
"Each of these Transformer blocks includes an attention layer and a feedforward neural network (also known as an mlp or multilevel perceptron). We’ll cover these in more detail later in the chapter."
I believe that an MLP is more commonly known as a multilayer perceptron rather than a multilevel perceptron. For instance, Wikipedia refers to it as a multilayer perceptron, and the top search results for a Google search for "multilevel perceptron" are results for "multilayer perceptron".
Thank you for your consideration of this feedback.
Note from the Author or Editor: Suggestions are never too minor!
Agreed, that should be "multilayer" instead of "multilevel".
|
Alvin G. |
Dec 14, 2024 |
|
|
Page Page 19
Figure 1-22 |
"I am llamas", should be "I love llamas" as on previous examples
Note from the Author or Editor: In the image, "am" should indeed be "love" instead.
|
Dmytro Nikolaiev |
Dec 23, 2024 |
|
|
Page Page 79
Forth bulletpoint |
Interpret MLP as "multilevel perceptron", should be "multilayer perceptron"
Note from the Author or Editor: Agreed, that should be "multilayer" instead of "multilevel".
|
Dmytro Nikolaiev |
Dec 23, 2024 |
|
|
Page Page 201
Paragraph before the tip |
"For now, it's important to know that we will use an 8-bit variant of Phi-3" but the code example uses 16-bit version
Note from the Author or Editor: 8-bit should indeed be changed to 16-bit.
|
Dmytro Nikolaiev |
Dec 23, 2024 |
|
|
Page Page 299
In the middle of the page |
"during evaluation" is mentioned in description of both per_device_train_batch_size and per_device_eval_batch_size
Note from the Author or Editor: "during evaluation" should indeed be "during training" for the `per_device_train_batch_size`.
|
Dmytro Nikolaiev |
Dec 23, 2024 |
|
|
Page Page 339
After the code snippet |
Mentions "F1 score 0.85" while the code shows "0.8363" which should be rounded to "0.84"
Note from the Author or Editor: This is indeed a rounding error and should be 0.84 instead of 0.85.
|
Dmytro Nikolaiev |
Dec 23, 2024 |
|
|
Page Pages 340-341
Figures 11-14 and 11-15 |
Figures 11-14 and 11-15 contain "[CLS] What a horrible movie [MASK]!", should be "[CLS] What a horrible [MASK]!"
Note from the Author or Editor: That is correct, and we indeed should remove the word "movie" here.
|
Dmytro Nikolaiev |
Dec 23, 2024 |
|
|
Page Page 370
lora_alpha parameter description |
"A rule of thumb is to choose a value twice the size of r" but the value is half the r instead
Note from the Author or Editor: The rule of thumb is indeed correct but the code should be updated.
Instead of:
```python
# Prepare LoRA Configuration
peft_config = LoraConfig(
lora_alpha=32, # LoRA Scaling
lora_dropout=0.1, # Dropout for LoRA Layers
r=64, # Rank
bias="none",
task_type="CAUSAL_LM",
target_modules= # Layers to target
['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)
```
It should be instead:
```python
# Prepare LoRA Configuration
peft_config = LoraConfig(
lora_alpha=128, # LoRA Scaling
lora_dropout=0.1, # Dropout for LoRA Layers
r=64, # Rank
bias="none",
task_type="CAUSAL_LM",
target_modules= # Layers to target
['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)
```
Another option would be to decrease r to 32 and set lora_alpha to 64 but that would require additional testing.
|
Dmytro Nikolaiev |
Dec 23, 2024 |
|
|
Page Page 17, Figure 1-19
Figure 1-19 |
In the figure, the transformer encoder output is mentioned as "I am a student" while the decoder output is something different "I love llamas". This was supposed to be a translation transformer from English to Dutch according to the flow of the chapter leading to this place. This is confusing and needs re-check for possible error.
Note from the Author or Editor: In Figure 1-19, "I am a student" should be "Ik houd van lama's" instead.
|
Rajaram |
Dec 24, 2024 |
|
|
Page Chapter 5
Figure 5.14 |
In the numerator of figure 5.14, the “average frequency of all words across all clusters”, might be incorrect. We have 3 clusters, the total words frequency across all clusters is:
(9 + 9 + 7 + 22 + 1 + 9 + 1) = 58
so:
A = 58 / 3 ~= 19.3
Note from the Author or Editor: This should indeed be 19.3 instead of the shown 29.
|
Davide Restivo |
Feb 08, 2025 |
|
|
Page Chapter 6
Figure 6-1 |
LLama2 model description is missing a '/'
Wrong: 7B/13B70B
Correct: 7B/13B/70B
Note from the Author or Editor: That is indeed a typo and should be updated.
|
Davide Restivo |
Feb 09, 2025 |
|
|
Page Figure 3-5
Figure 3-5 |
Figure 3-5. The tokenizer has a vocabulary of 50,000 tokens. The model has token embeddings associated with those embeddings.
Should be
Figure 3-5. The tokenizer has a vocabulary of 50,000 tokens. The model has token embeddings associated with those **tokens**.
Note from the Author or Editor: Agreed. Thank you!
|
Anonymous |
Feb 09, 2025 |
|
|
Page Page 127, Section name: Text classification with generative models
first paragraph |
These models take as input some text and generative text and are thereby aptly named sequence-to-sequence models.
"generative" is a typo, should be generate:
These models take as input some text and generate text and are thereby aptly named sequence-to-sequence models.
Note from the Author or Editor: That is correct. It should indeed be "generate" instead of "generative".
|
Du Liu |
Mar 08, 2025 |
|
|
Page Page 359, Parameter-Efficient Fine-Tuning (PEFT)
3rd paragraph |
In Paragraph 3, you say, "... fine-tuning 3.6% of the parameters of BERT for a task can yield comparable performance to fine-tuning all the model's weights.". This is somewhat incorrect since the Adapters paper adds *new* weights that are trained and whose number is 3.6% of the original model's weights. It is not fine-tuning 3.6% of the parameters of the original BERT model.
Note from the Author or Editor: Thank you for the errata!
That is indeed correct. It should be made clearer that it is about adding parameters (which are derived from the original model). So, although the 3.6% is still accurate, additional text is needed to make clear that these parameters are not only derived from the original model but also compressed.
|
Sameer Indarapu |
Mar 21, 2025 |
|
|
Page 30
Last 2 paragraphs |
In the Open Models section, it might be worth further emphasizing the difference between open weight access (which still limits transparency and reproducibility) vs. fully open source access (providing the full codebase and information required for retraining the model from scratch).
I believe this distinction should be highlighted as they have very different implications for AI research and responsible development (having access just to the weights limits thorough investigation of training dynamics, biases, and ethical concerns).
As an example of a fully open source model, it might be worth citing OLMo, which provides unrestricted access to the training dataset, codebase, weights, evaluation suite, and intermediate checkpoints.
Note from the Author or Editor: Great feedback. We initially focused on demonstrating open weights vs. closed weights (so only the model is potentially shared) since that covered 90% of use cases. In practice, it seldom happens that the model's parameters code (and especially datasets) are shared alongside it. As a result, people most often talk about sharing a model's parameters as "open source" (even though that strictly does not cover it).
In an updated version, I fully agree that further highlighting that mentions of "open source" in the community typically means "open weights" and not shared code. It especially doesn't do justice to initiatives that are true "open source", like the OLMo model that the reader mentioned.
Likewise, although DeepSeek did not release the training data (and as such is not fully "open source"), they did release many libraries for inference and training that go way beyond just "open weights".
It would be nice to add this distinction to an updated version alongside with the grey area that exist between "open weights" and "open source" (also considering different types of licenses, see MIT/Apache vs. more restrictive licenses).
|
Diego Carpintero |
Nov 24, 2024 |
|
|
Page 49
GPT-2 Colorized Token Output |
The example output does not match the actual output when running the code. For example: the spaces/tabs do not have multiple tokens.
Note from the Author or Editor: Good catch. The colored tokens should show spaces inside the quotation marks as in the notebook (https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter02/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb#scrollTo=89JpLpZmlhXr) for the gpt2 example. Thank you!
|
Sean Moreno |
Feb 19, 2025 |
|
|
Page 49
Final bullet point |
Is this point conflating 'tab' and 'space'?
Note from the Author or Editor: Correct.
"and the four spaces are represented as three tokens" should say instead:
"and the three tabs are represented as three tokens"
Thank you!
|
Sean Moreno |
Feb 19, 2025 |
|
|
Page 51
final bullet point under the section "GPT-4(2023)" |
The final bullet point under the section "GPT-4(2023)" says "Refer back to what we said about the GPT-2 tokenizer with regards to the L tokens".
But the "GPT-2(2019)" section (on page 49) doesn't say anything about "L tokens", instead it has a duplicated bullet point. The 5th bullet point duplicates the 4th bullet point.
Note from the Author or Editor: This is referring to the emoji and Chinese character (????鸟). The correction here is that instead of the Ł symbol that appears in the book, it should show ????鸟 (the music note emoji and the Chinese character). Thank you!
|
Du Liu |
Feb 16, 2025 |
|
|
Page 57
2nd paragraph |
"Instead of representing each token or word with a static vector, language models create contextualized word embeddings (shown in Figure 2-8) that represent a word with a different token based on its context."
I think the tokenization locks the token ID - it is the embedding that is not static, but depending on context. Therefore the text would be correct and more clear like:
Instead of representing each token ID with a static vector, language models create contextualized embeddings (shown in Figure 2-8) that represent a token ID with a different vector based on its context.
Note from the Author or Editor: Agree that it's more accurate if the sentence said "embedding" instead of "token":
>"... that represent a word with a different _embedding_ based on its context."
|
Jakob Riishede Møller |
Jan 12, 2025 |
|
|
Page 59
3rd paragraph |
in this code example, there appears to be a mismatch mixing different model versions in the AutoTokenizer and AutoModel. When looking at the papers, the vocab size 'deberta-v3' is 128k whereas in 'deberta-base' was 50k, and the internal architecture have also some differences. This might lead to issues.
A tentative fix might be to use 'microsoft/deberta-v3-base' as the tokenizer:
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
instead of:
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")
## See Chapter 2: Tokens and Embeddings, page 59
from transformers import AutoModel, AutoTokenizer
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")
# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")
# Tokenize the sentence
tokens = tokenizer('Hello world', return_tensors='pt')
# Process the tokens
output = model(**tokens)[0]
Note from the Author or Editor: Agreed. Nice catch! We do need to change the tokenizer. Thank you!
|
Diego Carpintero |
Nov 24, 2024 |
|
|
Page 120
Figure 4-9 |
Accuracy formula: Shouldn't divisor be TP + TN + FP + FN
Note from the Author or Editor: That is indeed a typo.
The image shows `TP + TN + FP + FP`
But it should be instead `TP + TN + FP + FN`
Note the last value (`FP`), this should be updated (`FN`).
|
Marcel Neuhausler |
Nov 04, 2024 |
|
|
Page 146
1st Code box |
On page 145, the code is currently written as:
```
to_plot = df.loc[df.cluster != "-1", :]
outliers = df.loc[df.cluster == "-1", :]
```
However, I believe the correct code should be:
```
clusters_df = df.loc[df.cluster != "-1", :]
outliers_df = df.loc[df.cluster == "-1", :]
```
This adjustment aligns with the code provided on your official GitHub repository, ensuring consistency.
Note from the Author or Editor: This was indeed a typo, most likely a leftover from earlier versions. Considering we updated this in the official GitHub repository, it should indeed be updated to:
```python
clusters_df = df.loc[df.cluster != "-1", :]
outliers_df = df.loc[df.cluster == "-1", :]
```
|
Junpyo Lee |
Nov 11, 2024 |
|
|
Page 150
Figure 5-14 |
Should numerator of the first term of IDF log calculation should be 2.8 not 29? 2.8 = Sum of frequencies in Figure 5-13 divided by count of frequencies = 58/21
Note from the Author or Editor: It's the other way around, it should be 58/3 (number of clusters) instead (see related errata).
|
Michael Shearer |
Dec 23, 2024 |
|
|
Page 170
First Paragraph - Controlling Model Output |
The para says "In our previous example, you might have noticed.....we used several parameters...including temperature and top_p.".
But this is not correct. The previous example does not use temperature or top_p.
Note from the Author or Editor: The previous example indeed does not use `temperature` or `top_p`, so this sentence can indeed be removed.
|
RAJARAM |
Jan 10, 2025 |
|
|
Page 202
in the code section commented below # Select outliers and non-outliers (clusters) |
variable name used to save the outliers and nonoutliers used in the next section o the code for ploting is outliers_df and clusters_df. instead of those variable name to_plot and outliers variable name is used
Note from the Author or Editor: This is correct.
On page 145 of the book, there are two lines of code that create the `to_plot` and `outliers` variables. The names of these variables should be `clusters_df`and `outliers_df`respectively.
|
Dharmendra Rajak |
Jan 24, 2025 |
|
|
Page 244
Third paragraph |
Reference to "Pretrained transformers for text tanking: BERT and beyond..." should be "Pretrained transformers for text ranking: BERT and beyond..."
'Ranking' is spelt 'Tanking'
Note from the Author or Editor: Correct, definitely should be "ranking".
"text tanking" does sound like a great name for a technique ;)
|
Liam Bluett |
Dec 09, 2024 |
|
|
Page 252,253
code snippet |
Page 252, last paragraph of the book (printed version) reads: "we'll start by downloading a quantized model", but in the code snippet it uses the model "Phi-3-mini-4k-instruct-fp16.gguf" as follows:
!wget [...]Phi-3-mini-4k-instruct-fp16.gguf
llm = LlamaCpp(
model_path="Phi-3-mini-4k-instruct-fp16.gguf",
n_gpu_layers=-1,
max_tokens=500,
n_ctx=2048,
seed=42,
verbose=False
)
- "Phi-3-mini-4k-instruct-fp16.gguf" is according to the model card *not* a quantized version, but a full precission 16 bit version.
- The quantized version for this model is listed in the card as: "Phi-3-mini-4k-instruct-q4.gguf", a 4-bit quantized version.
Note from the Author or Editor: That is indeed a typo as we expected to use the 4-bit quantized version. So instead of updating the text, we updated the code instead to reflect that in the official GitHub repository.
Therefore, we should update it to `model_path="Phi-3-mini-4k-instruct-q4.gguf"` instead (which was done in the repo).
|
Diego Carpintero |
Dec 14, 2024 |
|
|
Page 253
"Loading the embedding model" section |
The authors say they're going to use the "BAAI/bge-small-en-v1.5" model but in their code snippet they use "model_name='thenlper/gte-small'"
Note from the Author or Editor: The code snippet is where the update should take place. There, the `model_name` should be `BAAI/bge-small-en-v1.5`. This was already updated in the GitHub repo to reflect the text.
|
Liam Bluett |
Oct 22, 2024 |
|
|
Page 297
2nd Code box |
On page 297, the code currently appears as:
dataset[2]
Since train_dataset was defined earlier in the code, the correct reference should be:
train_dataset[2]
This change ensures consistency with the previous variable definition and the official GitHub repository.
Note from the Author or Editor: Correct, `dataset[2]` should be `train_dataset[2]` instead.
This was updated in the GitHub repo to reflect this.
|
Peter Kulisek |
Dec 11, 2024 |
|
| Printed |
Page 300
The tip box |
On page 300 of the printed book, you will find a tip box mentioning the batch size in MNR loss. This is the same tip box as on page 307 and the former can be safely ignored.
Do note though that the larger batch sizes in general also help speedup training, especially if you have enough VRAM. Likewise, the tip does not relate only to MNR losses but to more losses that share similar mechanics. In other words, upping the batch size is seldom a bad idea if your device can handle it.
|
Maarten Grootendorst |
Nov 13, 2024 |
|
|
Page 313
Third sentence |
The authors state "we use the remaining 400,000 sentence pairs (from our original dataset of 50,000 sentence pairs)". This should be 40,000 not 400,000.
Note from the Author or Editor: Correct, this is indeed a typo and should be "40,000" instead of "400,000".
|
Liam Bluett |
Oct 22, 2024 |
|
| Printed |
Page 338
bottom |
On the bottom of the page, it mentions there are 20*32=680 samples. This should be 640 samples instead.
The same applies to 680*2= 1280. Here it should also be 640 instead of 680.
|
Maarten Grootendorst |
Dec 10, 2024 |
|
|
Page 341
Footnote |
"How to fine-tune GERT for text classification?"
should be
"How to Fine-Tune BERT for Text Classification?"
BERT is spelt GERT, typo.
Note from the Author or Editor: That's an interesting typo. That is a citation copied over from scholar I believe, which obviously contains BERT and not GERT. On a closer inspection, the footnote seems to be updated manually in later stages of the book's development, so perhaps it was a result of manually capitalizing the word.
Either way, should be BERT.
|
Liam Bluett |
Dec 05, 2024 |
|
|
Page 364
Second paragraph |
Two 12,228 x 2 matrices should be x 8
Note from the Author or Editor: There are actually two mistakes here:
* 12,288 x 2 should be 12,288 x 8
* 197K should be 98K
|
Michael Shearer |
Dec 27, 2024 |
|