Errata

Fundamentals of Data Engineering

Errata for Fundamentals of Data Engineering

Submit your own errata for this product.

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

Color key: Serious technical mistake Minor technical mistake Language or formatting error Typo Question Note Update

Version Location Description Submitted By Date submitted Date corrected
Printed
Page Acknowlegments
technical reviewers paragraph

Tod Hanseman should be spelled Tod Hansmann

Joe Reis
 
Jul 22, 2022  Jul 28, 2023
Page section "lambda Architecture" page 104
figure 3-14

The figure 3-14 illustrating the lambda architecture doesn't illustrate what is described in the paragraph above it.
the author says : "In a Lambda architecture (Figure 3-14), you have systems operating independently of each other—batch, streaming, and serving."
In the figure we have 2 streaming systems (the batch system is not shown) and a serving system.

Note from the Author or Editor:
The bottom box that says "stream processing" that attaches to "batch processing" should say "batch processing".

Anonymous  Nov 15, 2022  Jul 28, 2023
Page Acknowledgments - page xix
Upper section

Lior Gavish is mentioned twice

Note from the Author or Editor:
Please remove the second reference to Lior Gavish in acknowlegements

Igal drayerman  Feb 10, 2023  Jul 28, 2023
Printed
Page page 241, or in the Frequency sub section
middle image

Figure 7-4 shows ingestion frequencies of data in batch, micro batch, and real time. The sub headings of frequent and semi-frequent are in the wrong order.

It should be

batch = semi-frequent
micro-batch = frequent

Joe Reis
 
Jul 31, 2023  Mar 20, 2026
Page Page Number: 273
Section - Data Definition Language, 2nd Paragraph

On Page Number 273, within the Data Definition Language section, there it is mentioned in para2 that classifies "UPDATE" as a DDL expression. However, it should be noted that "UPDATE" is typically considered as a DML expression.

Note from the Author or Editor:
Thanks for spotting this error.

Divyansh Jain  Nov 16, 2023  Mar 20, 2026
Page p.307
l. -3.

The line
"That's it! Now let’s look at ways to view data contextually using satellites.'
does not seem to fit in this place

The line should be just below the table 8-18, and above the 'Satelites' paragraph.

Note from the Author or Editor:
This might read better if we move the "That's it! Now let’s look at ways to view data contextually using satellites.' sentence to the end of the Link section, after the sentence that says "Note that we're...". This is the sentence right before the satellite portion begins.

HIDEMOTO NAKADA  Jan 14, 2024  Mar 20, 2026
Page 10
Figure 1-3

Figure 1-3 is quite blurry in both the print and Kindle editions of the book

Note from the Author or Editor:
On the one hand, this is somewhat deliberate to emphasize the sheer number of tools in each diagram. On the other hand, it would be nice to at least make the left image with fewer tools more clear. There are higher resolution images available on Matt Turck's website, and I could also email Matt to request original files.

Sergio Ramos-Valverde  Nov 12, 2024  Mar 20, 2026
Page 49
Figure 2-7

Figure 2-7 under DataOps, the first item should be Automation, not Data Governance.
This would bring figure 2-7 in line with the items in figure 2-8.

Note from the Author or Editor:
Replace "Data Governance" with "Automation" in the diagram.

Sky Quintin  May 08, 2024  Mar 20, 2026
Page 103
Last paragraph

Acronym "HDFS" used without definition. Is it possible to include what it stand for?

Note from the Author or Editor:
We define HDFS the first time we use the acronym. I don't want to be too repetitive by spelling this out every time. One possible solution is to put HDFS in the index. (Right now, we have an entry for "Hadoop Distributed File System," but not for HDFS.)

Sergio Ramos-Valverde  Nov 27, 2024  Mar 20, 2026
Page 168
2nd paragraph

"...which we discuss at greater length in 'Messages and Streams' on page 167.)" should probably read, "...which we discuss at greater length in 'Message Queues and Event-Streaming Platforms' on page 259.)". In its current form, this is a self-referential breadcrumb, and the preceding paragraphs in the section do not "discuss at greater length," whereas the aforementioned section starting on page 259 does go into more detail. This is a particularly confusing typo due to the section name. Indeed, I did not understand the parenthetical until 100 pages later!

Note from the Author or Editor:
O'Reilly - can we fix this? Thanks.

Adam Shamlian  Nov 16, 2022  Jul 28, 2023
Page 170
Paragraph 4, "Lookups"

"Understand how to leverage for efficient extraction." Feels like it should read "Understand how to leverage them for efficient extraction." or "Understand how to leverage indexes for efficient extraction."

Note from the Author or Editor:
Fixed in Atlas branch mlhousley

Sergio Ramos-Valverde  Dec 05, 2024  Mar 20, 2026
Page 174
Bottom of page

The JSON object printed at the bottom of this page is not formatted properly for some of the nested data. It makes reading and interpreting what this data represents quite difficult.

The two lines after the lines starting with "name" ("first" and "last") should have two additional leading spaces. Same for four lines after "favorite_bands".

Joe Reis, co-author, sent me to this link after we discussed this on LinkedIn. I would be more than happy to volunteer to help out with helping fix formatting.

Note from the Author or Editor:
O'Reilly - can we better format this? Thanks.

Brian Armstrong  Dec 12, 2022  Jul 28, 2023
Page 182
3rd paragraph

"from routing messages between microservices ingesting millions of events per second of event data from web, mobile, and IoT applications." This feels like there should be to "to" somewhere in this sentence

Note from the Author or Editor:
Fixed in Atlas on branch mlhousley.

Sergio Ramos-Valverde  Dec 09, 2024  Mar 20, 2026
Page 184
3rd page paragraph; 1st paragraph in subsection "Topics."

The last sentence of the paragraph reads, "A topic can have zero, one, or multiple producers and customers on most event-streaming platforms."

It should probably read, "A topic can have zero, one, or multiple producers and consumers on most event-streaming platforms."

So, "consumers", not "customers ".

Note from the Author or Editor:
Confirmed. Please correct.

L. D. Nicolas May  Aug 27, 2023  Mar 20, 2026
Page 188
second to last paragraph

"...-should extend up and down the entire stack and support the data engineering and lifecycle."

Not sure the last and is supposed to be there?

Note from the Author or Editor:
Delete the last "and"

Sergio Ramos-Valverde  Dec 09, 2024  Mar 20, 2026
Page 219
First paragraph

In the first paragraph there is a reference to a figure that reads (see Figure 6-3).

It should reference Figure 6-2.

Note from the Author or Editor:
Confirmed

Mike Porter  Sep 03, 2023  Mar 20, 2026
Page 229
1st paragraph

"The storage price goes down from faster/higher performing storage to lower storage"

Shouldn't this be:

"The storage price goes down from faster/higher performing storage to slower storage"

Sergio Ramos-Valverde  Dec 17, 2024  Mar 20, 2026
Page 234
"Software Engineering" paragraph

"Make sure the code you write stores the data correctly and doesn't accidentally cause data, memory leaks, or performance issues."

What is meant by "...cause data"?

Note from the Author or Editor:
Edit: "Make sure that your code stores the data correctly and doesn't accidentally cause memory leaks or performance issues."

Sergio Ramos-Valverde  Dec 17, 2024  Mar 20, 2026
Page 243
Paragraph 2

"The big idea is that rather than relying on asynchronous processing, where a batch process runs for each stage as the input batch closes and certain time conditions are met, each stage of the asynchronous pipeline can process data items as they become available in parallel across the Beam cluster"

Shouldn't the first "asynchronous" be "synchronous" ?

Sergio Ramos-Valverde  Dec 18, 2024  Mar 20, 2026
Printed
Page 245
"Semistructured JSON" under "Shape"

"The key-value pairs and nesting depth occur with subelements"

Is the word "occur" needed? Unsure of the meaning with it in there.

Note from the Author or Editor:
"The key-value pairs and nesting structure"

Sergio Ramos-Valverde  Dec 19, 2024  Mar 20, 2026
Printed
Page 246
Paragraph 4

"Schema is not only for databases. As we've discussed, present their schema complications"

Shouldn't this read as follows?:

"Schema is not only for databases. As we've discussed, present their own schema complications"

Note from the Author or Editor:
Text should read "As we've discussed, APIs present..."

Sergio Ramos-Valverde  Dec 19, 2024  Mar 20, 2026
Printed
Page 249
Paragraph 2

"Some size-based ingestion systems can break data into objects based on various criteria, such as the size in bytes of the total number of events."

Shouldn't this be

"Some size-based ingestion systems can break data into objects based on various criteria, such as the size in bytes or the total number of events."

Sergio Ramos-Valverde  Dec 19, 2024  Mar 20, 2026
Printed
Page 253
Paragraph 3

"Find the right balance of TTL impact on our data pipeline."

Shouldn't it read

"Find the right balance of TTL impact on your data pipeline."

Sergio Ramos-Valverde  Dec 19, 2024  Mar 20, 2026
Printed
Page 266
Paragraph 2

"In that case, your data sizes are presumably batching or streaming much smaller data sizes on an ongoing basis."

Should this be re-written? Feels awkward. Maybe:

"In that case, you are presumably batching or streaming much smaller data sizes on an ongoing basis."

or

"In that case, your data sizes are presumably at a batching or streaming scale on an ongoing basis."

Note from the Author or Editor:
Yes, the first suggestion is correct, thank you!

Sergio Ramos-Valverde  Dec 24, 2024  Mar 20, 2026
Page 267
Paragraph 2

"Involving product managers in the outcome and treating downstream data processed as part of a product encourages them to allocate scarce software development to collaboration with data engineers."

Shouldn't this be "software development time" or "software development resources?"

Note from the Author or Editor:
change from "scare software development" to "software development resources"

Sergio Ramos-Valverde  Dec 24, 2024  Mar 20, 2026
Printed
Page 268
Paragraph 1

"The last thing you want is to capture or compromise the data while moving."

Shouldn't this be:

"The last thing you want is for the data to be captured or compromised while moving."

Sergio Ramos-Valverde  Dec 24, 2024  Mar 20, 2026
Printed
Page 269
Paragraph 5

"This means that engineers develop and test code on simulated or cleansed data in development and staging environments but automated code deployments to production."

Shouldn't this be "automate" instead of "automated"?

Sergio Ramos-Valverde  Dec 24, 2024  Mar 20, 2026
Page 272
Paragraph 3

"An orchestration can start each ingestion task at the appropriate scheduled time..."

Unsure if this is actually a typo or not but feels like it should read:

"An orchestrator can start each ingestion task at the appropriate scheduled time..."

Sergio Ramos-Valverde  Dec 24, 2024  Mar 20, 2026
Printed
Page 287
Not sure, got feedback from someone

Page 287 "if new events arrive for the use" should be user

Joe Reis
 
Feb 11, 2023  Jul 28, 2023
Page 293
Paragraph 2

"Along this continuum (Figure 8-12), three main data models are conceptual, logical, and physical."

Shouldn't that be a "the" or "the three"?

Sergio Ramos-Valverde  Dec 27, 2024  Mar 20, 2026
Printed
Page 306,307,308
Data Vault section

Bundling a few together from this section.

1. "Using our ecommerce scenario, let's look at an example of a hub for products. First, let's look at the physical design of a product hub."

Either these are two redundant sentences, or this should read more like:

"Using our ecommerce scenario, let's look at an example of a data vault model for products. First, let's look at the physical design of a product hub."

2. All unpopulated examples show each column of a data vault table as a row, and then when populated, they are all columns. Table names are referenced in empty table examples but not populated ones.

I think examples similar to the normalization section table examples might make more sense. So, getting rid of empty table examples and changing table text more like: "Table 8-15. ProductHub"

But that's just one idea and this suggestion may be nitpicky on my part.

At the end of the Links paragraph on 307, you say "That's it! Now let's look at ways to view data contextually using satellites."

However, the very next paragraph you go into an example of a link table.

Note from the Author or Editor:
1. "First, let's look at the physical design."

2. We'll have to leave this as-is for now, but will keep in mind for future.

3. Fixed per another errata item

Sergio Ramos-Valverde  Dec 30, 2024  Mar 20, 2026
Page 318
Paragraph 1

"However, these aren't committed to a Git repository without an external CI/CD system to manage deployment."

Shouldn't this be:

"However, these aren't committed to a Git repository with an external CI/CD system to manage deployment."

Note from the Author or Editor:
Revised for clarity:
"However, UDFs aren’t automatically committed to a Git repository unless you've set up CI/CD processes to manage deployment."

Sergio Ramos-Valverde  Dec 31, 2024  Mar 20, 2026
Printed
Page 326
second to last paragraph

"A complex query that filters with a WHERE clause joins three tables and applies a window function that would consist of many map and reduce stages."

Shouldn't this read:

"A complex query that filters with a WHERE clause, joins three tables and applies a window function would consist of many map and reduce stages."

Note from the Author or Editor:
Yes, thank you!

Sergio Ramos-Valverde  Jan 01, 2025  Mar 20, 2026
Printed
Page 331
Last paragraph

"Let's take a simple example where streaming DAG would be useful."

Shouldn't this be: "Let's take a simple example where a streaming DAG would be useful."

Sergio Ramos-Valverde  Jan 01, 2025  Mar 20, 2026
Printed
Page 334
second to last paragraph

"If someone does have access to a dataset, continue to control who has access to a dataset's column, row, and cell-level access."

Not sure what is meant by this sentence but it feels a bit clunky. Possible alternatives based on my best guess of the intended meaning:

1. "If someone does have access to a dataset, you can further refine that access at the column, row, and cell level."

2. "If someone does have access to a dataset, you can further refine that access at the column, row, and cell level, according to the principle of least privilege."

3. "Access can be granted broadly for an entire dataset, or narrowed to specific columns, rows, or cells on a case by case basis."

Note from the Author or Editor:
Let's use your first suggestion, but with "cell" changed to "field"

Sergio Ramos-Valverde  Jan 02, 2025  Mar 20, 2026
Printed
Page 343
Paragraph 2

"Data validation is analyzing data to ensure that it accurately represents financial information, customer interactions, and sales."

This feels like a very narrow definition.

What if you work for risk and compliance teams, or internal IT? None of which deal directly in customer or financial data; "customer" could be meant in a more general sense to encompass these cases but I feel its misleading.

Also, aren't "sales" a type of "financial information" ?

In my opinion this could be expanded to something more generic like:

"Data validation is analyzing data to ensure that it accurately represents information from its "source of truth" (e.g. source systems or upstream datasets), and faithfully applies any agreed upon business logic.

This may involve checks such as walking a stakeholder through a manual reproduction of the export and transformation steps on sample data to confirm the end product reconciles with its source."

Note from the Author or Editor:
Change to:
Data validation is analyzing data to ensure that it accurately represents financial information, customer interactions, sales, etc.

Sergio Ramos-Valverde  Jan 03, 2025  Mar 20, 2026
Printed
Page 360
second to last paragraph

"...- the latter two are popular for data science applications, especially notebooks."

Is this supposed to be?:

"...- the latter two are popular for data science applications, especially in notebooks."

Sergio Ramos-Valverde  Jan 06, 2025  Mar 20, 2026
Printed
Page 363
Last paragraph

Unsure what is meant by "roll" in:

"While you can roll your reverse ETL solution, many off-the-shelf reverse ETL options are available."

Should that have been "roll out", or "build your own" ?

Note from the Author or Editor:
Should be "roll your own"

Sergio Ramos-Valverde  Jan 06, 2025  Mar 20, 2026
Printed
Page 375
Paragraph 5

"We're amazed at the number of companies with security policies in the hundreds of pages that nobody reads, the annual security policy review that people immediately forget, all in checking a box for a security audit."

I feel like this should be:

"We're amazed at the number of companies with security policies in the hundreds of pages that nobody reads, and annual security policy reviews that people immediately forget, all in the name of checking a box for a security audit."

or

"We're amazed at the number of companies with security policies in the hundreds of pages that nobody reads, and annual security policy reviews that people immediately forget, all in the name of checking a box or fear of a security audit."

Note from the Author or Editor:
First suggestion is good, thanks!

Sergio Ramos-Valverde  Jan 08, 2025  Mar 20, 2026
Printed
Page 376
Paragraph 3

"...give them only the privileges and data they need to do their jobs, and only for the timespan when needed."

Should be

"...give them only the privileges and data they need to do their jobs, and only for the timespan they're needed."

or

"...give them only the privileges and data they need to do their jobs when needed, and only for the timespan they're needed."

Note from the Author or Editor:
"give them only the privileges and data they need to do their jobs, and only for the necessary timespan"

Sergio Ramos-Valverde  Jan 08, 2025  Mar 20, 2026
Printed
Page 399
Paragraph 2

"Arrow has already spanned a new data warehouse product;"

Shouldn't this be

"Arrow has already spawned a new data warehouse product;"

Sergio Ramos-Valverde  Jan 10, 2025  Mar 20, 2026
Printed
Page 406
Paragraph 2

"Cloud providers offer CDN options and many other providers, such as Cloudflare."

Was this meant to be:

"Cloud providers and many other providers, such as Cloudflare offer CDN options."

Sergio Ramos-Valverde  Jan 10, 2025  Mar 20, 2026