Chapter 2. Anonymization

In this chapter, you’ll dive into anonymization: what it is, how to use it, and what factors you need to consider in using it for data science. You may already know about anonymization, and this chapter might challenge or contradict what you think you know! The topic of anonymization is one that has plagued researchers and scientists for decades—and still incites debate between different privacy professionals today. In this chapter, you’ll learn rigorous, scientific definitions of anonymization—​meaning you will learn about differential privacy. This will help you approach the problem with state-of-the-art technologies and give you tools that empower you to meet strong privacy protections while still performing accurate data science.

What Is Anonymization?

To anonymize is “to remove identifying information from (something, such as computer data) so that the original source cannot be known.” 1 When you anonymize personal data, you want to make sure the data cannot be traced back to a particular individual. But how exactly can you do that?

In the past, there have been several approaches, most of which have been debunked and replaced with newer understandings. Some of the initial approaches (categorized under Statistical Disclosure Control) used a variety of no longer recommended methods like suppression, aggregation, and transformations to anonymize the data. These methods were like early cryptographic ciphers. They obfuscated the original source, but a ...

Get Practical Data Privacy now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.