Preface
A few years ago we partnered with O’Reilly to write a book of case studies and methods for anonymizing health data, walking readers through practical methods to produce anonymized data sets in a variety of contexts.1 Since that time, interest in anonymization, sometimes also called de-identification, has increased due to the growth and use of data, evolving and stricter privacy laws, and expectations of trust by privacy regulators, by private industry, and by citizens from whom data is being collected and processed.
Why We Wrote This Book
The sharing of data for the purposes of data analysis and research can have many benefits. At the same time, concerns and controversies about data ownership and data privacy elicit significant debate. O’Reilly’s “Data Newsletter” on January 2, 2019, recognized that tools for secure and privacy-preserving analytics are a trend on the O’Reilly radar. Thus an idea was born: write a book that provides strategic opportunities to leverage the spectrum of identifiability to disassociate the personal from data in a variety of contexts to enhance privacy while providing useful data. The result is this book, in which we explore end-to-end solutions to reduce the identifiability of data. We draw on various data collection models and use cases that are enabled by real business needs, have been learned from working in some of the most demanding data environments, and are based on practical approaches that have stood the test of time.
The central question we are consistently asked is how to utilize data in a way that protects individual privacy, but still ensures the data is of sufficient granularity that analytics will be useful and meaningful. By incorporating anonymization methods to reduce identifiability, organizations can establish and integrate secure, repeatable anonymization processes into their data flows and analytics in a sustainable manner. We will describe different technologies that reduce identifiability by generalizing, suppressing, or randomizing data, to produce outputs of data or statistics. We will also describe how these technologies fit within the broader theme of “risk-based” methods to drive the degree of data transformations needed based on the context of data sharing.
Note
The purpose of a risk-based approach is to replace an otherwise subjective gut check with a more guided decision-making approach that is scalable and proportionate, resulting in solutions that ensure data is useful while being sufficiently protected. Statistical estimators are used to provide objective support, with greater emphasis placed on empirical evidence to drive decision making.
We have a combined three decades of experience in data privacy, from academic research and authorship to training courses, seminars, and presentations, as well as leading highly skilled teams of researchers, data scientists, and practitioners. We’ve learned a great deal, and we continue to learn a great deal, about how to put privacy technology into practice. We want to share that knowledge to help drive best practice forward, demonstrating that it is possible to achieve the “win-win” of data privacy that has been championed by the likes of former privacy commissioner Dr. Ann Cavoukian in her highly influental concept of Privacy by Design.2 There are many privacy advocates that believe that we can and should treat privacy as a societal good that is encouraged and enforced, and that there are practical ways we can achieve this while meeting the wants and needs of our modern society.
This is, however, a book of strategy, not a book of theory. Consider this book your advisor on how to plan for and use the full spectrum of anonymization tools and processes. The book will guide you in using data for purposes other than those originally intended, helping to ensure that data is not only richer but also that its use is legal and defensible. We will work through different scenarios based on three distinct classes of identifiability of the data involved, and provide details to understand some of the strategic considerations that organizations are struggling with.
Warning
Our aim is to help match privacy considerations to technical solutions. This book is generic, however, touching on a variety of topics relevant to anonymization. Legal interpretations are contextual, and we urge you to consult with your legal and privacy team! Materials presented in this book are for informational purposes only, and not for the purpose of providing legal advice. Okay, now that we’ve given our disclaimer, we can breathe easy.
Who This Book Was Written For
When conceptualizing this book, we divided the audience in two groups: those who need strategic support (our primary audience) and those who need to understand strategic decisions (our secondary audience). Whether in government or industry, it is a functional need to deliver on the promise of data. We assume that our audience is ready to do great things, beyond compliance with data privacy and data protection laws. And we assume that they are looking for data access models, to enable the safe and responsible use of data.
Primary audience (concerned with crafting a vision and ensuring the successful execution of that vision):
-
Executive teams concerned with how to make the most of data, e.g., to improve efficiencies, derive new insights, and bring new products to market, all in an effort to make their services broader and better while enhancing the privacy of data subjects. They are more likely to skim this book to nail down their vision and how anonymization fits within it.
-
Data architects and data engineers who need to match their problems to privacy solutions, thereby enabling secure and privacy-preserving analytics. They are more likely to home in on specific details and considerations to help support strategic decisions and figure out the specifics they need for their use cases.
Secondary audience (concerned with understanding the vision and how it will be executed):
-
Data analysts and data scientists who want to understand decisions made regarding the access they have to data. As a detail-oriented group, they may have more questions than we can cover in one book! From our experience this may lead to interest in understanding privacy more broadly (certainly a good thing).
-
Privacy professionals who wish to support the analytic function of an organization. They live and breathe privacy, and unless they have a technical background, they may actually want to dig into specific sections and considerations. That way they can figure out how they can support use cases with their strong knowledge and understanding of privacy.
A core challenge with writing a book of strategy about the safe and responsible use of data is striking the right balance in terms of language and scope. This book will cover privacy, data science, and data processing. Although we attempt to introduce the reader to some basic concepts in all of these areas, we recognize that it may be challenging for some readers. We hope that the book will serve as an important reference, and encourage readers to learn more where they feel it is needed.
How This Book Is Organized
We’ll provide a conceptual basis for understanding anonymization, starting with an understanding of identifiability, that is, providing a reasonable estimate of clustering based on identifying features in data and the likelihood of an attack. We will do this in two chapters, starting with the idea of an identifiability spectrum to understand identifiability in data in Chapter 2, and then a governance framework that explains the context of data sharing to understand threats in Chapter 3. Identifiability will be assessed in terms of both data and context, since they are intimately linked. Our identifiability spectrum will therefore evolve from the concept of data identifiability into one that encompasses both data and context.
From this conceptual basis of identifiability, we will then look at data processing steps to create different pipelines. We’ll start with identified data and concepts from privacy engineering in Chapter 4, that is, how to design a system with privacy in mind, building in protections and, in particular, reducing identifiability for those novel uses of data that fall outside of the original purposes of data collection. We will also touch on the subject of having both identified and anonymized data within the same data holdings.
Once we’ve established the requirements related to identified data, we will consider another class of data for which direct identification has been removed, which we explained above as being pseudonymized. This is the first step to reducing identifiability, by removing names and addresses of the people in the data. In Chapter 5, we start to explicitly work toward anonymizing data. We first look at how pseudonymization fits as data protection, and introduce a first step toward anonymization. We also consider analytics technologies that can sit on top of pseudonymized data, and what that means in terms of anonymization.
Our final data pipeline is focused entirely on anonymization in Chapter 6 (so entirely about secondary uses of data). We start with the more traditional approach of pushing the anonymization at source to a recipient. But then we turn things around, considering the anonymized data as being pulled by the recipient. This way of thinking provides an interesting opportunity to leverage anonymization from a different set of requirements, and opens up a way to build data lakes. We will do this by building on concepts introduced in other chapters, to come up with novel approaches to building a pipeline.
We finish the book in Chapter 7 with a discussion of the safe use of data, including the topics of accountability and ethics. The practical use of “deep learning” and related methods in artificial intelligence and machine learning (AIML) has introduced new concerns to the world of data privacy. Many frameworks and guiding principles have been suggested to manage these concerns, and we wish to summarize and provide some practical considerations in the context of building anonymization pipelines.
Conventions Used in This Book
The following typographical conventions are used in this book:
- Italic
-
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
-
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Tip
This element signifies a tip or suggestion.
Note
This element signifies a general note.
Warning
This element indicates a warning or caution.
O’Reilly Online Learning
Note
For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
- O’Reilly Media, Inc.
- 1005 Gravenstein Highway North
- Sebastopol, CA 95472
- 800-998-9938 (in the United States or Canada)
- 707-829-0515 (international or local)
- 707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/building-anonymization-pipeline.
Email bookquestions@oreilly.com to comment or ask technical questions about this book.
For news and more information about our books and courses, see our website at http://oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
This book would not be possible without the support of the many experts at Privacy Analytics, who work day in and day out in advisory, data and software delivery, and implementation. It’s one thing to theorize solutions, it’s quite another to work with organizations, large and small, to bring privacy practices and solutions to market and at scale. It’s through working with clients that real-world solutions are born and grow up.
We must gush about our technical reviewers! They took the time to read the entirety of the first draft of this book and provided valuable feedback. Their varied backgrounds provided critical insights. Their feedback to the manuscript allowed us to directly address areas in need of further development. While the views and opinions expressed in this book are our own, we hope that we successfully incorporated their feedback into the final version of this book. In alphabetical order, we wish to thank: Bryan Cline, an expert in standards and risk management; Jordan Collins, an expert in real-world anonymization; Leroy Ruggerio, an expert in business technology; and Malcolm Townsend, an expert in data protection technology.
We would also like to thank Felix Ritchie for having created and promoted the adoption of the Five Safes, which served as inspiration to us! An entire chapter is dedicated to the Five Safes, and we have been fortunate to work with Felix since we drafted our first version of that chapter. We appreciated the help of Pierre Chetelat with final edits, which also served as an opportunity for him to learn about the legal and technical landscape in which we work.
Finally, we must thank O’Reilly for giving us the opportunity to write another book about anonymization in practice. And also Melissa Potter, our development editor at O’Reilly, who supported us in the writing and editing of this book. We may not have visibility behind the curtain at O’Reilly, but we also thank their team of diligent copy editors, graphic artists, technical support, and everyone else who brings books to market.
1 Khaled El Emam and Luk Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started, (Sebastopol, CA: O’Reilly, 2014), http://oreil.ly/anonymizing-health-data.
2 Ann Cavoukian, “Privacy by Design: The 7 Foundational Principles,” Information and Privacy Commissioner of Ontario (January 2011), https://oreil.ly/eSQRA.
Get Building an Anonymization Pipeline now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.