Errata

AI Systems Performance Engineering

Errata for AI Systems Performance Engineering

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Page page 25
2nd paragraph

"In the GB200 superchip ... 480GB of LPDDR5X at up to ~500 TB/s".

The per superchip CPU memory bandwidth is 512GB/s, not TB/s, see Nvidia documentation (seems like I can't include the URL).

Note from the Author or Editor:
Thanks for catching this. Yes, we should change this.

Charles Natali  Jan 04, 2026  Jan 16, 2026
Page page 71
paragrapt2

> Memory pinning speeds up ... because the CUDA driver doesn't have to pin pages on the fly.

This is a bit misleading, because it could imply that in the non-pinned case, the runtime does pin the pages on the fly.

But it's not how it works, in that case it uses a pool of preallocated pinned staging/bounce buffers, and does an intermediate copy - avoiding this intermediate copy is the main reason for the speeduo.

Note from the Author or Editor:
Change sentence to:
Memory pinning speeds up the tensor.to(device) operations because it avoids the intermediate copy through the CUDA driver’s preallocated pinned staging bounce buffers that are typically used when the source is in pageable host memory.

Charles Natali  Jan 04, 2026  Jan 16, 2026
Page page 79
3rd paragraph

The sentence "For instance, two training..." is repeated.

Note from the Author or Editor:
@kristen: can you fix this when you get a chance? not urgent.

Charles Natali  Jan 04, 2026  Jan 16, 2026
Page Page 379 & 380, Chapter 10
second-to-last paragraph of Page 379, first paragraph of Page 380

These two paragraphs mentioned buffer `curr` and `next`, but there is no such names in the previous code. If I understand it correctly, the code is using a variable `s = tile % STAGES` to specify which one of the two buffers to use.

Note from the Author or Editor:
confirmed. fixing.

Liang He  Feb 07, 2026  Mar 13, 2026
Page Page 387, 388
the code snippet

I didn't test the warp_specialized_pipeline_kernel code snippet thoroughly (only tried basic test) but I feel there may be a couple of issues.
One is that if we call cuda::make_pipeline without specifying producer count, iiuc the pipe.producer_acquire() and pipe.consumer_wait() will block until all threads in current block reach that point, but since the execution goes into different branches based on warp id, the program will stall infinitely. (It turns out to be the case on my consumer pc RTX 4070 with sm_89 arch.)
The other is that there are two pipe.consumer_wait() call (for warp 1 and 2), iiuc we can't guarantee that it's always warp 1 consumes the output of warp 0 and warp 2 consumes output of warp 1, unless we insert __syncthreads() between the `if`s.
It's possible that I misunderstand something important, but I do feel confused here.

Note from the Author or Editor:
Confirmed. I will update the github repo - and I will update the book, as well. We will patch this soon. Thanks for catching this! @Kristen: I have the updates ready for you.

Liang He  Mar 01, 2026  Mar 13, 2026
Page 12
4

"For example, a 100-trillion-parameter model would require approximately 182 TB of
GPU memory (182 TB = 100 trillion parameters × 16 bits per weight × 8 bits per
byte) to load the model if each parameter is stored in 16-bit (2-byte) precision"

The error is in the formula: It should be 100 trillion parameters × 16 bits per weight ÷ 8 bits per byte (to convert bits to bytes), not × 8 bits per byte.

Note from the Author or Editor:
Confirmed. Good catch!

Nima Alizadeh  Feb 12, 2026  Mar 13, 2026
Page 122
last 2 paragraphs

Those paragraphs mention NCCL_NSOCKS_PERTHREAD and NCCL_SOCKET_NTHREADS tuning when using multiple NICs, but those are no-op when using RDMA (IB or RoCE), which people should be using for low latency and high bandwidth, as mentioned in the previous sections.
Might want to make it clear to avoid confusion.

Note from the Author or Editor:
@Kristen: Propose to either remove this paragraph and tip:

NCCL_NSOCKS_PERTHREAD and NCCL_SOCKET_NTHREADS. As mentioned, ...

This 64 limit is NCCL’s built-in maximum...

---

or change "As mentioned, if you have" to just "If you are using RDMA with"

---

Anonymous  Jan 10, 2026  Jan 16, 2026
Page 129
Tree and NVLSTree

​> An all-reduce is actually a reduce-scatter followed by a broadcast.

​Suggested Correction: "An all-reduce in the tree algorithm is implemented as a reduce to the root followed by a broadcast."

​In NCCL, the "Tree" algorithm is a pipelined Reduce-then-Broadcast pattern. The term "Reduce-Scatter" specifically refers to an operation where data is partitioned among ranks (typically used in the Ring algorithm). Using "Reduce-Scatter" here is misleading because the Tree algorithm's primary performance characteristic - lower latency but lower bandwidth utilization - stems from the fact that it does not scatter the data, but rather passes the full message through the tree hierarchy.

Note from the Author or Editor:
@Kristen: Please replace that whole section with the following:

Tree and NVLSTree
In the tree algorithm, reductions and broadcasts are done in a tree structure using the spanning tree algorithm. An all-reduce in the tree algorithm is implemented as a reduce to the root followed by a broadcast. A tree can complete an all-reduce in O(log N) steps for N GPUs—as opposed to O(N) for the ring. As such, a tree algorithm provides lower latency for smaller messages. However, it may not fully utilize all links for large messages because not all GPUs transmit all the time. Some GPUs are leaves of the tree and send only one time up the tree, for instance. NCCL’s tree algorithm is optimized and often used for smaller message sizes in which the total time is dominated by the transfer-startup latency. This is known as a latency-dominated workload—in contrast to the ring algorithm’s bandwidth-dominated use case for large messages. Using NVLSTree will enable NVLink SHARP offload.

Charles Natali  Jan 10, 2026  Jan 16, 2026
Page 130
CollNet

The whole section on CollNet seems quite confusing, not mentioning the main characteristic which is in-network reduction.

Below is an extract of "Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms":

The CollNet algorithms are intended for scenarios where the network infrastructure itself can participate in collective operations, such as using NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) technology, allowing reductions or other partial collective computations to be offloaded to network switches, thereby reducing data movement and latency [18].

CollNet algorithms leverage NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) technology for network-assisted collective operations. CollNet Direct enables all-to-all communication within the node. In contrast, CollNet Chain arranges GPUs linearly, and performs reductions up the chain and broadcasts down [19].

Note from the Author or Editor:
@Kristen: Let's change to the following:

CollNet (hierarchical collectives across nodes). CollNet, also known as tree parallelism, combines two collective strategies to optimize communication at different scales. The CollNet algorithms are designed for environments where the network can actively assist with collective operations. When supported by technologies such as NVIDIA SHARP, parts of the reduction can be executed directly within the network fabric. This reduces the volume of data that must traverse the network and lowers end to end communication latency. First, it groups GPUs that share a fast local interconnect, such as all GPUs in a single node or within an NVSwitch island. CollNet then applies a high throughput algorithm such as a ring or local tree to aggregate the data within each group of GPUs. One designated leader GPU from each group participates in the second level tree reduction across groups. This minimizes the number of cross group communication rounds. By layering a local reduction on top of a global tree exchange, CollNet delivers both low latency for internode transfers and high bandwidth for intranode traffic. This makes it especially effective at reducing network load in very large, multinode GPU clusters. NCCL implements CollNet through multiple variants. CollNet Direct supports dense communication among GPUs within a node. CollNet Chain organizes GPUs into an ordered sequence, performing the reduction in one direction and distributing the result in the reverse direction.

Let's reference https://arxiv.org/pdf/2507.04786.

Charles Natali  Jan 10, 2026  Jan 16, 2026
Page 130
CollTree

I am not sure what "CollTree" refers to.
This is not a referenced NCCL algorithm, and appears neither in the GitHub repository, NCCL documentation nor "Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms".

Note from the Author or Editor:
@Kristen: please remove this bullet point. collTree(Up/Down) is an implementation detail of CollNet, not a separate NCCL algorithm, and should therefore be removed. I've updated the CollNet bullet point, as well (see other erratum).

Charles Natali  Jan 10, 2026  Jan 16, 2026
Page 209
code block showing multidimensional inputs

blocksPerGrid2D and threadsPerBlock2D are not defined.
So no example of dim3.

Note from the Author or Editor:
Will fix. Good catch!

Charles Natali  Feb 15, 2026  Mar 13, 2026
Page 269
1st paragraph

The explanation for why the transpose kernel has bank conflict is backwards. It says that "all threads access tile[constant_row][varying_col]" when the problem is actually because the naive transpose kernel accesses tile[varying_row][constant_col]. Hence the explanation afterward in that paragraph is also incorrect.

Note from the Author or Editor:
fixing. thanks again for catching this!

Leo Xu  Mar 17, 2026 
Page 269
1st paragraph

The explanation for why the transpose kernel has bank conflict is backwards. It says that "all threads access tile[constant_row][varying_col]" when the problem is actually because the naive transpose kernel accesses tile[varying_row][constant_col]. Hence the explanation afterward in that paragraph is also incorrect.

Note from the Author or Editor:
Fixing. Thanks for pointing this out!

Leo Xu  Mar 17, 2026 
Page 541
Gradient synchronization overhead

The text states that increasing bucket_cap_mb allows operations to launch "sooner." This is technically inaccurate. In PyTorch DDP, a bucket is only synchronized once it is full. Increasing the size from 25MB to 50MB means the system must wait for more gradients to be computed, delaying the initial launch of the NCCL all-reduce. The performance benefit of larger buckets comes from improved network throughput and reduced per-message overhead, not from earlier execution.

Note from the Author or Editor:
good catch! i will update for the next revision.

thanks!!

Charles Natali  Feb 21, 2026  Mar 13, 2026
Page 598
end of chapter 13, key takeaways

2nd subtitle => "Prefer compile mode versus versus eager mode"

delete extra "versus"

Note from the Author or Editor:
fixed. thanks again for catching this!

O'Reilly Media
 
Mar 02, 2026  Mar 13, 2026
Page 602
Ch 13 conclusion

first line => "microlevel profilng" (second i of profiling is missing)

it should be microlevel profiling

also you may consider using => micro-level
and also there is macrolevel in the next line

Note from the Author or Editor:
fixed! thanks for noting this.

O'Reilly Media
 
Mar 02, 2026  Mar 13, 2026
Page 667
Tip at top of page

In the second sentence "The goal is to overlap cop and compute using [...]"

The "cop" should be "copy".

Note from the Author or Editor:
fixing. thanks for catch this!

Steven Rygaard  Mar 20, 2026 
Page 684
Tip

torch.nn.attendion.sdpa_kernel()

attendion should be attention

Note from the Author or Editor:
@Kristen: please update when you get a chance.

O'Reilly Media
 
Jan 06, 2026  Jan 16, 2026
Page 702
Figure 16-1

green box title should be Decode instance rather than Prefill instance

Note from the Author or Editor:
@Kristen: Please update when you get a chance.

O'Reilly Media
 
Jan 06, 2026  Jan 16, 2026