On July 8, 2013, at 7:34 p.m., Bradley Cooper got into a cab in Tribeca and arrived on Bank Street ten minutes later. He paid his fare of $9.00 in cash. I know this not because I followed Bradley that day, nor because I read this in the tabloids, but from an analysis done by writer JK Trotter of a dataset made available by the NY City Taxi & Limousine Commission. All Trotter had to do was combine cab data with archives of celebrity photos.
In 2004, the New York City & Limousine Commission released anonymized datasets of about 173 million taxi trips. The anonymization, however, was so skimpy that malicious parties could easily decipher personally identifiable information of every driver in the available dataset.
What’s most alarming in situations like this is the information set created by aggregating information from other sources, as Trotter did. By combining indirectly related factors, one can reidentify seemingly anonymized personal data.
Data analytics require large datasets that usually contain a lot of sensitive data. Let’s take an example. In the following sample dataset, “Key attributes” are attributes that can directly identifiable and are usually anonymized. So what are “Quasi-identifiers?” The 5-digit ZIP code, birth date, gender form a quasi-identifier and uniquely identify 87% of the population in the U.S. can be used for linking anonymized dataset with other datasets. This was found in a study about k-anonymity by Latanya Sweeney, the Director of the Data Privacy Lab in the Institute of Quantitative Social Science at Harvard. And there’s a third category of attributes – “sensitive attributes.” These can be medical records, salaries, demographic or other attributes that researchers need, so they cannot be anonymized haphazardly.
That brings me to a point I want to bring to the table – do we need unanonymized data for analytics? The answer is no. What you do need is realistic data. The important requisite for this realistic data is that it is anonymized in a way that does not hamper the analytical value of the data.
How do we anonymize while keeping the business value of the data intact? The first step is to sort the data classifications into direct identifiers and quasi identifiers. The next step is to ensure that no combination of these data classifications can lead to re-identification. This can be done by maintaining demographical logic in the anonymized data. For example, Andre born on 1/21/76 can be anonymized to Miles born on 1/23/76. What we did here is:
Gender remains same
Name anonymized to the same number of 5 letters
Date of birth changed by 2 days, while month and year remain the same.
This data maintains the same demographic which is ideal for analytics without giving away any personally identifiable information. In other words, we maintained demographical logic, and not absolute demographic.
Now let’s look at a real-life example (diagram 1). A few banks have picked a company with a proprietary analytics algorithm to analyse their data for them. Before this, they need to make sure they preserve the privacy of their data, while ensuring that the data is not anonymized in such a way that affects its analytical value. For instance, relevant data points such as geography should remain real (diagram 2). Therefore, the state and the zip code remain the same, but the street and the street numbers are anonymized. Similarly, the gender of the anonymized data record stays the same, but the name is scrambled, while maintaining the same number of characters, to generate realistic but anonymized data. Once the banks send the realistic data, the analytics algorithm analyses trends and historical data received from these banks to understand which product portfolio will work well in which area.
Businesses possess large reserves of data – this data is an asset to the company but only if it is available for analysis. Ensuring a responsible use of this data can be achieved with the help of anonymized analytics – an anonymization solution that provides realistic data that meets data rules and validations .
But anonymization for the sake of anonymization will wreck the analytical value of the data, which is why a lot of projects fail. Data rules and validations also need to be considered.
The concept of responsible anonymization – one that recognizes the need for the data’s richness but doesn’t compromise the actual individual’s data — is the gold standard. This is what organizations should be aiming for.