Sunday, September 23, 2012

Data De-identification Dilemma

De-identification is a process of removing various elements of the dataset, so that the data row would cease to be personally identifiable to an individual. This is all about protecting the privacy of the users of systems as backed by legislations prevalent in many countries. While HIPAA in the US is the most known act that provides for protection of personally identifiable data, many other countries also have promulgated legislations to regulates the handling of such data in varying degrees.
Most organizations are increasingly becoming security aware as they are getting impacted by the related risks of not appropriately protecting the data and information assets. For the purpose this discussion we can assume that appropriate checks and controls are in place for data in the active store. But the cloud evolution and increasing integration of external systems requires that the data when exchanged or disclosed to any interconnected system or stored elsewhere on the cloud to support different needs including back up or business analytics requires that such datasets that is so disclosed or stored elsewhere need to be de-identified, so that the privacy interests of the such individuals are protected and in turn comply with applicable privacy legislations.
Under HIPAA, individually identifiable health information is de-identified if the following specific fields of data are removed or generalized:
  • Names
  • Geographic subdivisions smaller than a state
  • All elements of dates (except year) related to an individual (including dates of admission, discharge, birth, death)
  • Telephone & FAX numbers
  • Email addresses
  • Social security numbers
  • Medical record numbers
  • Health plan beneficiary numbers
  • Account numbers
  • Certificate / license numbers
  • Vehicle identifiers and serial numbers including license plates
  • Device identifiers and serial numbers
  • Web URLs
  • Internet protocol addresses
  • Biometric identifiers (including finger and voice prints)
  • Full face photos and comparable images
  • Any unique identifying number, characteristic or code
In today’s context, a vast amount of personal information is becoming available from various public and private sources all around the world, which include public records like, telephone directories, property records, voters register and even the social networking sites. The chances of using these data to link against de-identified data and there by being able to re-identify the individual is high. Professor Sweeney testified that there is a 0.04% chance that data de-identified under the health rule’s methodology could be re-identified when compared to voter registration records for a confined population.
Others have also written about the shortcomings of de-identification. A June 2010 article by Arvind Narayanan and Vitaly Shmatikov offers a broad and general conclusion:  
The emergence of powerful re-identification algorithms demonstrates not just a flaw in a specific anonymization technique(s), but the fundamental inadequacy of the entire privacy protection paradigm based on “de-identifying” the data.
With various tools and technologies, it may be possible at times to achieve probably absolute de-identification. However, it seems unlikely that there is a general solution that will work for all types of data, all types of users, and all types of activities. Thus, we continue to face the possibility that de-identified personal data shared for research and other purposes may be subject to re-identification.
There is a wide variance in the regulatory requirement on the subject amongst various legislations. While some require removal of specific data fields, some mandates for adherence to certain administrative processes and few others require compliance to one or more standards.
Robert Gellman in his paper titled as The deidentification dilemma: A legislative and contractual proposal, calls for a contractual solution, backed by a new legislation. However, irrespective of it being backed by legislation or not it would be wise to follow this approach as it helps bind the data recipients to the requirements of the data discloser. With the use of SaaS applications on the rise the chances of the data being stored elsewhere and being on the wire is very high. The increasing need for data and application integrations over the cloud across various partner organizations is again makes the need for such a contractual solution a must.
The core proposal in the legislation is a voluntary data agreement, which is a contract between a data discloser and a data recipient. The PDDA will only apply to those who choose to accept its terms and penalties through a data agreement. The PDDA establishes standards for behaviour and civil and criminal penalties for violations. In exchange, there are benefits to the discloser and recipient.
With the above requirement and understanding on the de-identification of data, let us list down the possible circumstances, which will mandate data de-identification as below:
  • All non production database instances, which includes the development, test, training and production support instances of the databases as may be maintained by an organization. It is quite prevalent that the DBAs do maintain and run scripts to anonymize the personal data before such instance is exposed for general use by the intended users. But it is also important to ensure that the anonymization is in line with regulatory requirements of the region depending upon where such instances are hosted.
  • The increased use of business analytics call for maintenance of one or more data marts, which happens to be a replica of the production database. While it would absolutely fine, if such data marts store data summarized at a level such that each row does not represent one individual, care has to be taken just in case the micro level data is also maintained in the mart to facilitate drill through.
  • Application controls – All systems that work with databases containing personally identifiable information should be designed in such a way that appropriate access controls are built in to protect the sensitive information from being displayed or extracted out.
  • Remote workers & mobility challenges – Organizations have started accepting the culture of remote working and employee mobility. That means that the employees would be accessing the data through one or more applications from anywhere in the world using a multitude of devices. This call for an appropriate policy, checks and controls to be compliant with the privacy legislations.
  • Partner systems – In today’s connected world, business partners, who might be customers or vendors or even contracted outsourced service providers to gain access to the systems and databases of the organization. This certainly calls for a careful evaluation of the culture and voluntary agreement by such parties to be compliant with the organization’s data privacy needs. This even calls for periodic training and audit for the employees and systems of such partner organization.
Today’s lack of clear definitions, de-identification procedures, and legal certainty can impede some useful data sharing. It can also affect privacy of users when the lack of clarity about de-identification results in sharing of identifiable data that could have been avoided. The approach proposed by Robert Gellman will make available a new tool that fairly balances the needs and interests of data disclosers, data users, and data subjects. The solution could be invoked voluntarily by data disclosers and

data recipients. Its use could also be mandated by regulation or legislation seeking to allow broader use of personal data for beneficial purposes.

The Deidentification Dilemma: A Legislative and Contractual Proposal
-- Robert Gellman - Version 2.4, July 12, 2010