Is big data personal data?
Privacy laws are written to protect “personal information” held by organizations – that is, any information that could potentially identify specific individuals. Yet as larger and larger volumes of data are collected and aggregated through “big data” initiatives, the definition of personal information is getting fuzzy. Many corporations are creating “data lakes”: massive repositories of relatively unstructured data collected from one or several sources. These often contain a mix of metadata (user activity data, such as web search terms, IP addresses, or GPS data) and personal content (such as personal messages and social media posts) which, in combination, can very frequently identify individuals. For example, publicly available, searchable databases of Twitter activity map tweets by location – locations so specific that they can show which house a supposedly anonymous person was in when they posted a message. In the commercial realm, some retailers and entertainment venues have begun tracking customers’ movements through their buildings using smart phone MAC addresses, which can be considered personal information.
As these examples show, data lakes can contain extremely sensitive personal information. This data is not intended to be viewed by anyone – it is usually processed by computer algorithms – but too often, any employee involved in data analysis can access personal information about specific individuals. Privacy principles have traditionally emphasized access control: providing users with access only to the information that they need to do their work. But does access control have any meaning in unstructured big data environments?
Eroding informed consent
Access control is based on the privacy principle of limiting the use and disclosure of individuals’ personal information to those purposes for which it was collected. In other words, it is about informed consent: individuals have a right to know what their personal information will be used for before they share it with an organization. The reality is that companies that use big data often have legally dubious consent practices. Lengthy, impenetrable terms of use violate the spirit of informed consent. Many users of social media, in particular, do not know who has access to their personal information.
Is de-identification compatible with data lakes?
Data lakes are not set up to protect identity. They usually contain a mix of different types of personal data that in combination can often identify individuals. Privacy regulations demand that personal information be protected by access controls, yet data lakes are not structured for access control: usually, anyone with access has access to all of the data. Though the data may be intended for aggregate level analysis and not viewing of individual-level data, an employee intent on selling data to identity thieves or stalking an ex-spouse could do a great deal of damage.
Rather than access control, the concept of “use control” may be more appropriate to big data contexts: focusing not so much on limiting which data can be accessed, but in what form it can be accessed. De-identification and anonymization methods offer ways of converting personal information holdings into aggregated, non-identifiable data that can be analyzed without revealing individual-level data.
Several tools for anonymized analytics are available that support the ability to query datasets without having access to individual-level data. Because this approach avoids the need to de-identify a data set for all possible queries, data distortion applied by conventional de-identification techniques is substantially reduced, offering higher-utility data for analytics. Anonymized analytics further support the dynamic release of live data, making this approach suitable for open data and big data environments. More than any other de-identification approach currently available, anonymized analytics have the potential to facilitate the greatly expanded data use we are seeing, but with minimal privacy risk.