Chapter 7. Data Reduction: Research Registry Revisited

We often see massive data sets that need to be de-identified. With such large data sets, we are looking for ways to speed up the computations. We noted earlier that the de-identification process is often iterative, and it’s important to expedite these iterations so we can reach a final data set rapidly. In enterprise settings there may be a time window for de-identification to complete so that subsequent analytics can be performed. This also demands more rapid execution of de-identification.

For large and complex data sets traditional de-identification may also result in considerable information loss. We therefore need to find ways to reduce the size and complexity of the data that minimize information loss but still provide the same privacy assurances.

The Subsampling Limbo

A straightforward data reduction approach is subsampling, in which we randomly sample patients from the data set to create a subset of records—the subset is the data set that is then released. The sampling fraction is the proportion of patients that are included in the subset. In the case of a cross-sectional data set, we sample based on records; in the case of a longitudinal data set, we sample based on patients instead. Otherwise, we could end up with a lot of records from those patients with an extreme number of records, even though they’re in the minority. That wouldn’t result in the correct sampling fraction, since we wouldn’t have a representative subsample ...

Get Anonymizing Health Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.