The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
| Version |
Location |
Description |
Submitted By |
Date submitted |
Date corrected |
|
Page page 25
2nd paragraph |
"In the GB200 superchip ... 480GB of LPDDR5X at up to ~500 TB/s".
The per superchip CPU memory bandwidth is 512GB/s, not TB/s, see Nvidia documentation (seems like I can't include the URL).
Note from the Author or Editor: Thanks for catching this. Yes, we should change this.
|
Charles Natali |
Jan 04, 2026 |
Jan 16, 2026 |
|
Page page 71
paragrapt2 |
> Memory pinning speeds up ... because the CUDA driver doesn't have to pin pages on the fly.
This is a bit misleading, because it could imply that in the non-pinned case, the runtime does pin the pages on the fly.
But it's not how it works, in that case it uses a pool of preallocated pinned staging/bounce buffers, and does an intermediate copy - avoiding this intermediate copy is the main reason for the speeduo.
Note from the Author or Editor: Change sentence to:
Memory pinning speeds up the tensor.to(device) operations because it avoids the intermediate copy through the CUDA driver’s preallocated pinned staging bounce buffers that are typically used when the source is in pageable host memory.
|
Charles Natali |
Jan 04, 2026 |
Jan 16, 2026 |
|
Page page 79
3rd paragraph |
The sentence "For instance, two training..." is repeated.
Note from the Author or Editor: @kristen: can you fix this when you get a chance? not urgent.
|
Charles Natali |
Jan 04, 2026 |
Jan 16, 2026 |
|
Page Page 379 & 380, Chapter 10
second-to-last paragraph of Page 379, first paragraph of Page 380 |
These two paragraphs mentioned buffer `curr` and `next`, but there is no such names in the previous code. If I understand it correctly, the code is using a variable `s = tile % STAGES` to specify which one of the two buffers to use.
Note from the Author or Editor: confirmed. fixing.
|
Liang He |
Feb 07, 2026 |
Mar 13, 2026 |
|
Page Page 387, 388
the code snippet |
I didn't test the warp_specialized_pipeline_kernel code snippet thoroughly (only tried basic test) but I feel there may be a couple of issues.
One is that if we call cuda::make_pipeline without specifying producer count, iiuc the pipe.producer_acquire() and pipe.consumer_wait() will block until all threads in current block reach that point, but since the execution goes into different branches based on warp id, the program will stall infinitely. (It turns out to be the case on my consumer pc RTX 4070 with sm_89 arch.)
The other is that there are two pipe.consumer_wait() call (for warp 1 and 2), iiuc we can't guarantee that it's always warp 1 consumes the output of warp 0 and warp 2 consumes output of warp 1, unless we insert __syncthreads() between the `if`s.
It's possible that I misunderstand something important, but I do feel confused here.
Note from the Author or Editor: Confirmed. I will update the github repo - and I will update the book, as well. We will patch this soon. Thanks for catching this! @Kristen: I have the updates ready for you.
|
Liang He |
Mar 01, 2026 |
Mar 13, 2026 |
|
Page 12
4 |
"For example, a 100-trillion-parameter model would require approximately 182 TB of
GPU memory (182 TB = 100 trillion parameters × 16 bits per weight × 8 bits per
byte) to load the model if each parameter is stored in 16-bit (2-byte) precision"
The error is in the formula: It should be 100 trillion parameters × 16 bits per weight ÷ 8 bits per byte (to convert bits to bytes), not × 8 bits per byte.
Note from the Author or Editor: Confirmed. Good catch!
|
Nima Alizadeh |
Feb 12, 2026 |
Mar 13, 2026 |
|
Page 122
last 2 paragraphs |
Those paragraphs mention NCCL_NSOCKS_PERTHREAD and NCCL_SOCKET_NTHREADS tuning when using multiple NICs, but those are no-op when using RDMA (IB or RoCE), which people should be using for low latency and high bandwidth, as mentioned in the previous sections.
Might want to make it clear to avoid confusion.
Note from the Author or Editor: @Kristen: Propose to either remove this paragraph and tip:
NCCL_NSOCKS_PERTHREAD and NCCL_SOCKET_NTHREADS. As mentioned, ...
This 64 limit is NCCL’s built-in maximum...
---
or change "As mentioned, if you have" to just "If you are using RDMA with"
---
|
Anonymous |
Jan 10, 2026 |
Jan 16, 2026 |
|
Page 129
Tree and NVLSTree |
> An all-reduce is actually a reduce-scatter followed by a broadcast.
Suggested Correction: "An all-reduce in the tree algorithm is implemented as a reduce to the root followed by a broadcast."
In NCCL, the "Tree" algorithm is a pipelined Reduce-then-Broadcast pattern. The term "Reduce-Scatter" specifically refers to an operation where data is partitioned among ranks (typically used in the Ring algorithm). Using "Reduce-Scatter" here is misleading because the Tree algorithm's primary performance characteristic - lower latency but lower bandwidth utilization - stems from the fact that it does not scatter the data, but rather passes the full message through the tree hierarchy.
Note from the Author or Editor: @Kristen: Please replace that whole section with the following:
Tree and NVLSTree
In the tree algorithm, reductions and broadcasts are done in a tree structure using the spanning tree algorithm. An all-reduce in the tree algorithm is implemented as a reduce to the root followed by a broadcast. A tree can complete an all-reduce in O(log N) steps for N GPUs—as opposed to O(N) for the ring. As such, a tree algorithm provides lower latency for smaller messages. However, it may not fully utilize all links for large messages because not all GPUs transmit all the time. Some GPUs are leaves of the tree and send only one time up the tree, for instance. NCCL’s tree algorithm is optimized and often used for smaller message sizes in which the total time is dominated by the transfer-startup latency. This is known as a latency-dominated workload—in contrast to the ring algorithm’s bandwidth-dominated use case for large messages. Using NVLSTree will enable NVLink SHARP offload.
|
Charles Natali |
Jan 10, 2026 |
Jan 16, 2026 |
|
Page 130
CollNet |
The whole section on CollNet seems quite confusing, not mentioning the main characteristic which is in-network reduction.
Below is an extract of "Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms":
The CollNet algorithms are intended for scenarios where the network infrastructure itself can participate in collective operations, such as using NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) technology, allowing reductions or other partial collective computations to be offloaded to network switches, thereby reducing data movement and latency [18].
CollNet algorithms leverage NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) technology for network-assisted collective operations. CollNet Direct enables all-to-all communication within the node. In contrast, CollNet Chain arranges GPUs linearly, and performs reductions up the chain and broadcasts down [19].
Note from the Author or Editor: @Kristen: Let's change to the following:
CollNet (hierarchical collectives across nodes). CollNet, also known as tree parallelism, combines two collective strategies to optimize communication at different scales. The CollNet algorithms are designed for environments where the network can actively assist with collective operations. When supported by technologies such as NVIDIA SHARP, parts of the reduction can be executed directly within the network fabric. This reduces the volume of data that must traverse the network and lowers end to end communication latency. First, it groups GPUs that share a fast local interconnect, such as all GPUs in a single node or within an NVSwitch island. CollNet then applies a high throughput algorithm such as a ring or local tree to aggregate the data within each group of GPUs. One designated leader GPU from each group participates in the second level tree reduction across groups. This minimizes the number of cross group communication rounds. By layering a local reduction on top of a global tree exchange, CollNet delivers both low latency for internode transfers and high bandwidth for intranode traffic. This makes it especially effective at reducing network load in very large, multinode GPU clusters. NCCL implements CollNet through multiple variants. CollNet Direct supports dense communication among GPUs within a node. CollNet Chain organizes GPUs into an ordered sequence, performing the reduction in one direction and distributing the result in the reverse direction.
Let's reference https://arxiv.org/pdf/2507.04786.
|
Charles Natali |
Jan 10, 2026 |
Jan 16, 2026 |
|
Page 130
CollTree |
I am not sure what "CollTree" refers to.
This is not a referenced NCCL algorithm, and appears neither in the GitHub repository, NCCL documentation nor "Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms".
Note from the Author or Editor: @Kristen: please remove this bullet point. collTree(Up/Down) is an implementation detail of CollNet, not a separate NCCL algorithm, and should therefore be removed. I've updated the CollNet bullet point, as well (see other erratum).
|
Charles Natali |
Jan 10, 2026 |
Jan 16, 2026 |
|
Page 209
code block showing multidimensional inputs |
blocksPerGrid2D and threadsPerBlock2D are not defined.
So no example of dim3.
Note from the Author or Editor: Will fix. Good catch!
|
Charles Natali |
Feb 15, 2026 |
Mar 13, 2026 |
|
Page 269
1st paragraph |
The explanation for why the transpose kernel has bank conflict is backwards. It says that "all threads access tile[constant_row][varying_col]" when the problem is actually because the naive transpose kernel accesses tile[varying_row][constant_col]. Hence the explanation afterward in that paragraph is also incorrect.
Note from the Author or Editor: fixing. thanks again for catching this!
|
Leo Xu |
Mar 17, 2026 |
|
|
Page 269
1st paragraph |
The explanation for why the transpose kernel has bank conflict is backwards. It says that "all threads access tile[constant_row][varying_col]" when the problem is actually because the naive transpose kernel accesses tile[varying_row][constant_col]. Hence the explanation afterward in that paragraph is also incorrect.
Note from the Author or Editor: Fixing. Thanks for pointing this out!
|
Leo Xu |
Mar 17, 2026 |
|
|
Page 541
Gradient synchronization overhead |
The text states that increasing bucket_cap_mb allows operations to launch "sooner." This is technically inaccurate. In PyTorch DDP, a bucket is only synchronized once it is full. Increasing the size from 25MB to 50MB means the system must wait for more gradients to be computed, delaying the initial launch of the NCCL all-reduce. The performance benefit of larger buckets comes from improved network throughput and reduced per-message overhead, not from earlier execution.
Note from the Author or Editor: good catch! i will update for the next revision.
thanks!!
|
Charles Natali |
Feb 21, 2026 |
Mar 13, 2026 |
|
Page 598
end of chapter 13, key takeaways |
2nd subtitle => "Prefer compile mode versus versus eager mode"
delete extra "versus"
Note from the Author or Editor: fixed. thanks again for catching this!
|
O'Reilly Media |
Mar 02, 2026 |
Mar 13, 2026 |
|
Page 602
Ch 13 conclusion |
first line => "microlevel profilng" (second i of profiling is missing)
it should be microlevel profiling
also you may consider using => micro-level
and also there is macrolevel in the next line
Note from the Author or Editor: fixed! thanks for noting this.
|
O'Reilly Media |
Mar 02, 2026 |
Mar 13, 2026 |
|
Page 667
Tip at top of page |
In the second sentence "The goal is to overlap cop and compute using [...]"
The "cop" should be "copy".
Note from the Author or Editor: fixing. thanks for catch this!
|
Steven Rygaard |
Mar 20, 2026 |
|
|
Page 684
Tip |
torch.nn.attendion.sdpa_kernel()
attendion should be attention
Note from the Author or Editor: @Kristen: please update when you get a chance.
|
O'Reilly Media |
Jan 06, 2026 |
Jan 16, 2026 |
|
Page 702
Figure 16-1 |
green box title should be Decode instance rather than Prefill instance
Note from the Author or Editor: @Kristen: Please update when you get a chance.
|
O'Reilly Media |
Jan 06, 2026 |
Jan 16, 2026 |