De-identification in a Renewed PIPEDA

Canada’s main private-sector privacy law, the Personal Information Protection and Electronic Documents Act (PIPEDA) is currently under revision by the Office of the Privacy Commissioner. A discussion paper published by the Commissioner suggests a variety of potential solutions to new privacy issues catalyzed by an acceleration of the collection, retention, use, and disclosure of personal data often described as ‘big data.’ Along with improved consent practices and new ethical assessments, de-identification is a proposed solution. We would agree that risk-based de-identification can be used effectively to protect privacy in big data contexts.

A variety of sophisticated new data liberation technologies allow users access to data while masking or erasing the identity of the data source, utilizing de-identification techniques such as tokenization or anonymization. Optimally used with automated risk analysis tools, de-identification allows both ongoing utilization of data and protection of individual privacy.

De-identification does have limits at present. Most current technologies focus on the protection of text records. Given the proliferation of recording technologies (such as smart phone cameras, Google Glass, or drones) future privacy-bolstering technologies will need to adapt to different kinds of content, and an individual’s rights therein. For example:

  • Video privacy: Does an individual consent to be photographed or filmed? If not, privacy-bolstering technology could allow the image to be masked or erased.
  • Audio privacy: Does an individual consent to be recorded? If not, privacy-bolstering technology could allow the relevant part of the recording to be masked or erased.

Beyond technological limitations, however, there are certain ethical discussions about the use of de-identified data that need to take place. Effective de-identification may completely conceal individual identities, but the ways in which data is used have broad impacts on society. Conclusions drawn on the basis of big data analytics affect marketing decisions, media coverage, and corporate and government policy. Apart from the question of individual privacy, we suggest that big data should be seen as a common good. Just as corporations need a “social license” to exploit publicly owned natural resources, they should also be required to engage in meaningful public consultation about the uses of big data.

A Threat and Risk Assessment Approach for Big Data

Current information security and privacy classifications are being applied with some difficulty to a Big Data environment, which in the context of the public sector involves the emergence of large databases and increased data sharing. Such databases are usually classified as high-risk, resulting in costly security safeguards. However, de-identification can drastically lower the actual privacy risk posed by information. Could mapping de-identification to risk classifications allow organizations to invest more wisely in security and take advantage of the opportunities of Big Data?

Threat and Risk Assessments (TRA) are commonly required for new Canadian government programs and other public sector initiatives in order to determine whether their information assets are being protected appropriately. Their focus is on security: examining the potential for harm if information is accessed, released, or used inappropriately; analyzing potential risks to information; and identifying appropriate lifecycle safeguards to protect information.

In 2005, the federal government released the Canadian Information Security and Privacy Classification Policy as a guideline for risk assessments. This system defines four risk levels, based on criteria such as potential threats to public safety, injury to individuals or enterprises, financial loss, and damage to government relationships and reputation. Appropriate safeguards are identified for each risk level. The Ontario Ministry of Government Services has since adopted these classifications as a guide for TRAs within the Ontario Public Service.

Applying these classifications to a broad variety of public sector contexts has led to a couple of significant problems, both related to the phenomenon of Big Data. The federal classification guidelines were clearly designed with a political context in mind: examples given for the various risk levels include cabinet documents, briefings, speeches, and contact information. At the provincial level, these classifications do not translate easily to contexts such as healthcare, where information is collected in large volumes and regularly shared between organizations. The first problem is that the large volume of information contained in healthcare databases results in a great potential for harm in the event of a breach; consequently, such databases usually are classified as high-risk. The safeguards mandated to protect high-risk information are costly, and with the emergence of Big Data, these costs are likely to grow exponentially. The second problem pertains to information sharing: not only is there a possibility that high-risk, classified information is being shared with parties with inadequate security safeguards, but the sharing of personal information raises a number of more basic privacy issues.

To resolve these issues, government needs to stop conflating privacy with security. On the one hand, it is possible for information to be protected by adequate security safeguards but to violate privacy law nonetheless. A significant issue in the healthcare sector has been that of cascading rights when organizations share personal health information for research purposes. While all of the organizations involved may have effective security practices, the information is often disclosed and used for purposes to which patients did not consent. Because shared information is stored in multiple locations, it is often also retained longer than mandated by privacy standards. On the other hand, it is possible to protect privacy without security. Sophisticated and efficient de-identification processes can remove identifying details from records containing personal information while preserving the utility of data for research. Properly de-identified information can be shared with only a minimal risk to privacy.

The distinctions between privacy and security have a couple of implications: first, process matters when it comes to protecting data. Excellent security safeguards will not ensure proper information management if privacy concerns are not integrated into business processes and practices. Second, de-identification can radically change information risk. Calculations of re-identification risk – the probability that an individual could be identified based on their (de-identified) data – provide an objective measure of privacy risk. When privacy risk is very low, fewer security safeguards are needed. Thus, mapping levels of de-identification to information risk classifications could enable much more efficient and effective investment in information safeguards. An approach that unites privacy and security with regard to risk classification could well be the means to unlock the opportunities offered by Big Data while containing the costs of information security.