De-Identification by Design

De-identification is becoming a necessity for organizations that use and share large personal data holdings. The goal of de-identification is to ensure that personal data used for secondary purposes cannot identify individuals. This means much more than just deleting names and government identification numbers. As de-identification becomes a common requirement, labour-intensive and error-prone manual de-identification no longer makes sense. We help organizations to move toward “de-identification by design”: integrating de-identification capabilities into the design of information systems.

Organizations of all kinds are creating and sharing data at an exponentially growing pace.  Much of this data pertains to individuals and their interactions with public institutions, businesses, non-profit organizations, and many others. Beside its original use in providing services to individuals, this data may be used for numerous purposes, including research, program evaluation, reporting, and consumer analytics. Privacy regulations stipulate that personal data used for such secondary purposes must be de-identified – altered so that it cannot identify individuals. De-identification is increasingly becoming a part of operations for many organizations that make use of large personal data holdings.

The challenge of de-identification is to ensure that privacy remains protected, regardless of where personal data travels or how it is used. While many organizations still carry out de-identification manually, a growing suite of tools, methods, and guidelines are available to organizations seeking to de-identify data more efficiently and reliably.

Approaches to De-identification

Many organizations initially attempt to de-identify data manually. Typically, technical staff is assigned to delete or randomize database fields such as names, government identification numbers, and birthdates that pose a high risk of re-identifying individuals.

The difficulty with this approach is that individuals can potentially be re-identified by any unique combination of properties. For example, the combination of an ethnic identifier, gender, number of children, and postal code could point to one or two individuals who could be identified by someone with access to additional information sources. Organizations can reduce privacy risk by sharing only the data fields relevant to a data recipient’s specific needs. However, this is not usually what is done; entire databases are often released with only the most obviously identifying fields removed. Even at its best, manual de-identification is a time-consuming process that depends on subjective judgments about which data is safe to disclose.

A more sophisticated approach is to purchase a tool to evaluate the risk of re-identification. Several tools are available that calculate the probability that individuals could be identified on the basis of a given dataset. Risk measurement tools can support a more effective and defensible approach to de-identification.

The challenge of using these tools is to understand and make decisions on the basis of the risk statistics they provide. Unfortunately, many organizations do not have staff with the expertise to determine which data fields should be de-identify and which de-identification techniques are appropriate. Management often lacks the background knowledge necessary to interpret risk data and make decisions about appropriate risk levels.

De-identification by Design

A more integrative approach is to design or redesign a database to support de-identification. The ideal scenario for disclosing data to a third party is to create a dataset that includes only the specific data fields relevant to their purpose. However, the reality is that organizations typically attempt to de-identify an existing database for disclosure. This can be very risky if the database is not structured appropriately or if adequate risk measurement techniques are not used.

There are three key steps to designing or redesigning a database for de-identification:

  1. Structural design. Many databases have structural vulnerabilities that can undermine de-identification. A common problem is the existence of hidden tables or notes that can be overlooked during the de-identification process. The first step toward setting up a database suitable for de-identification is to determine its structural requirements and identify data risks. For example, a key structural requirement is that the database be fully normalized: data should be indexed by random identification codes rather than direct identifiers such as names or government identification numbers, and only one index table should link identification codes to direct identifiers. Identifying data risks means identifying unusual properties within the database that create a high risk of identification: for example, medical diagnoses that are rare within a particular age bracket, or geographic identifiers pertaining to a small population. These data risks become the initial targets for de-identification.
  2. Functional design. With appropriate expertise, software functions capable of de-identifying data without reducing its utility can be integrated into a database.  Such functions might include shifting the dates of client transactions within a realistic time frame, or replacing individuals’ names with randomly generated fictional names. The capacity to replace personal data with altered or randomly generated data is especially important when information technology staff needs to perform testing and checks, which generally require access to a version of the database that has all of the same fields and functions as the live database.
  3. Risk measurement design. A significant measure of privacy risk in the context of de-identification is the capacity to locate low count records: any combination of properties that is unique to a particular individual or a very small group of individuals. A database designed for de-identification will include programming code that can identify low count records that should be deleted in order to reduce privacy risk. Eliminating low count records ensures that the risk of individuals being re-identified is extremely low.

In practice, these steps may not be precisely linear, but summarize the most important aspects of a de-identification by design strategy. Once a database has appropriate structures, functions, and risk measurement tools for de-identification, preparing the data for release to a particular recipient is a relatively straightforward, efficient and repeatable process. By taking the guesswork out of de-identification, a de-identification by design approach can make data sharing a safe and regular part of operations rather than an exceptional and risky event.