Book description
Take advantage of today's sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and LinkedIn share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges.
Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers.
Topics include:
- The Importance of Data Lineage - Julien Le Dem
- Data Security for Data Engineers - Katharine Jarmul
- The Two Types of Data Engineering and Data Engineers - Jesse Anderson
- Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy
- The End of ETL as We Know It - Paul Singman
- Building a Career as a Data Engineer - Vijay Kiran
- Modern Metadata for the Modern Data Stack - Prukalpa Sankar
- Your Data Tests Failed! Now What? - Sam Bail
Publisher resources
Table of contents
- Preface
- 1. A (Book) Case for Eventual Consistency
- 2. A/B and How to Be
- 3. About the Storage Layer
- 4. Analytics as the Secret Glue for Microservice Architectures
- 5. Automate Your Infrastructure
- 6. Automate Your Pipeline Tests
- 7. Be Intentional About the Batching Model in Your Data Pipelines
- 8. Beware of Silver-Bullet Syndrome
- 9. Building a Career as a Data Engineer
- 10. Business Dashboards for Data Pipelines
- 11. Caution: Data Science Projects Can Turn into the Emperor’s New Clothes
- 12. Change Data Capture
- 13. Column Names as Contracts
- 14. Consensual, Privacy-Aware Data Collection
- 15. Cultivate Good Working Relationships with Data Consumers
- 16. Data Engineering != Spark
- 17. Data Engineering for Autonomy and Rapid Innovation
- 18. Data Engineering from a Data Scientist’s Perspective
- 19. Data Pipeline Design Patterns for Reusability and Extensibility
- 20. Data Quality for Data Engineers
- 21. Data Security for Data Engineers
- 22. Data Validation Is More Than Summary Statistics
- 23. Data Warehouses Are the Past, Present, and Future
- 24. Defining and Managing Messages in Log-Centric Architectures
- 25. Demystify the Source and Illuminate the Data Pipeline
- 26. Develop Communities, Not Just Code
- 27. Effective Data Engineering in the Cloud World
- 28. Embrace the Data Lake Architecture
- 29. Embracing Data Silos
- 30. Engineering Reproducible Data Science Projects
- 31. Five Best Practices for Stable Data Processing
- 32. Focus on Maintainability and Break Up Those ETL Tasks
- 33. Friends Don’t Let Friends Do Dual-Writes
- 34. Fundamental Knowledge
- 35. Getting the “Structured” Back into SQL
- 36. Give Data Products a Frontend with Latent Documentation
- 37. How Data Pipelines Evolve
- 38. How to Build Your Data Platform like a Product
- 39. How to Prevent a Data Mutiny
- 40. Know the Value per Byte of Your Data
- 41. Know Your Latencies
- 42. Learn to Use a NoSQL Database, but Not like an RDBMS
- 43. Let the Robots Enforce the Rules
- 44. Listen to Your Users—but Not Too Much
- 45. Low-Cost Sensors and the Quality of Data
- 46. Maintain Your Mechanical Sympathy
- 47. Metadata ≥ Data
- 48. Metadata Services as a Core Component of the Data Platform
- 49. Mind the Gap: Your Data Lake Provides No ACID Guarantees
- 50. Modern Metadata for the Modern Data Stack
- 51. Most Data Problems Are Not Big Data Problems
- 52. Moving from Software Engineering to Data Engineering
- 53. Observability for Data Engineers
- 54. Perfect Is the Enemy of Good
- 55. Pipe Dreams
- 56. Preventing the Data Lake Abyss
- 57. Prioritizing User Experience in Messaging Systems
- 58. Privacy Is Your Problem
- 59. QA and All Its Sexiness
- 60. Seven Things Data Engineers Need to Watch Out for in ML Projects
- 61. Six Dimensions for Picking an Analytical Data Warehouse
- 62. Small Files in a Big Data World
- 63. Streaming Is Different from Batch
- 64. Tardy Data
- 65. Tech Should Take a Back Seat for Data Project Success
-
66. Ten Must-Ask Questions for Data-Engineering Projects
-
Haidar Hadi
- Question 1: What Are the Touch Points?
- Question 2: What Are the Granularities?
- Question 3: What Are the Input and Output Schemas?
- Question 4: What Is the Algorithm?
- Question 5: Do You Need Backfill Data?
- Question 6: When Is the Project Due Date?
- Question 7: Why Was That Due Date Set?
- Question 8: Which Hosting Environment?
- Question 9: What Is the SLA?
- Question 10: Who Will Be Taking Over This Project?
-
Haidar Hadi
- 67. The Data Pipeline Is Not About Speed
- 68. The Dos and Don’ts of Data Engineering
- 69. The End of ETL as We Know It
- 70. The Haiku Approach to Writing Software
- 71. The Hidden Cost of Data Input/Output
- 72. The Holy War Between Proprietary and Open Source Is a Lie
- 73. The Implications of the CAP Theorem
- 74. The Importance of Data Lineage
- 75. The Many Meanings of Missingness
- 76. The Six Words That Will Destroy Your Career
- 77. The Three Invaluable Benefits of Open Source for Testing Data Quality
- 78. The Three Rs of Data Engineering
- 79. The Two Types of Data Engineering and Data Engineers
- 80. The Yin and Yang of Big Data Scalability
- 81. Threading and Concurrency in Data Processing
- 82. Three Important Distributed Programming Concepts
- 83. Time (Semantics) Won’t Wait
- 84. Tools Don’t Matter, Patterns and Practices Do
- 85. Total Opportunity Cost of Ownership
- 86. Understanding the Ways Different Data Domains Solve Problems
- 87. What Is a Data Engineer? Clue: We’re Data Science Enablers
- 88. What Is a Data Mesh, and How Not to Mesh It Up
- 89. What Is Big Data?
- 90. What to Do When You Don’t Get Any Credit
- 91. When Our Data Science Team Didn’t Produce Value
- 92. When to Avoid the Naive Approach
- 93. When to Be Cautious About Sharing Data
- 94. When to Talk and When to Listen
- 95. Why Data Science Teams Need Generalists, Not Specialists
- 96. With Great Data Comes Great Responsibility
- 97. Your Data Tests Failed! Now What?
- Contributors
- Index
Product information
- Title: 97 Things Every Data Engineer Should Know
- Author(s):
- Release date: June 2021
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781492062417
You might also like
book
97 Things Every Engineering Manager Should Know
Tap into the wisdom of experts to learn what every engineering manager should know. With 97 …
audiobook
Software Architecture: The Hard Parts
There are no easy decisions in software architecture. Instead, there are many hard parts-difficult problems or …
book
Software Architecture: The Hard Parts
There are no easy decisions in software architecture. Instead, there are many hard parts--difficult problems or …
book
Fundamentals of Data Engineering
Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and …