Dealing with data privacy – anonymization techniques

Internet companies of the world have the resources and power to be able to collect a microscopic-level of detail on each and every one of its users and build their user profiles. In this day and age, it’s almost delusional to think that we still operate in a world that sticks by the good, old ideals of privacy. You have probably experienced, at some point in your life, a well-targeted email, phone call, letter or advertisement.

Why should we care about privacy?

“If someone had nothing to hide, why should he/she care?” You have probably heard this argument before. Let’s use an analogy that explains why some people do care, despite having “nothing to hide”:

You just came home from a date. You are really excited, and can’t believe how awesome the person you are dating is. In fact, it feels almost too good to be true how this person just ‘gets you’, and it feels like he/she has known you for a very long time. However, as time goes by, the person you are dating starts to change and the romance wears off. You notice from unintentionally glimpsing at your date’s work desk that there is a folder stuffed with your personal information. From your place of birth, to your book membership status and somehow even your parents’ contact information! You realise this data was used to relate to you on a personal level. The folder doesn’t contain anything that shows you are of bad character, but you still feel betrayed and hurt that the person you are dating disingenuously tried to create feelings of romance.

As data scientist, we don’t want to be the date who lost another person’s trust, but we also don’t want to have zero understanding of the other person. How can we work around this this challenge?

Simple techniques to anonymize data

A simple approach to maintaining personal privacy when using data for predictive modelling or to glean insightful information is to scrub the data.

Scrubbing is simply removing personally identifiable information such as name, address and date of birth. However, cross-referencing this with public data or other databases you may have access to could be used to fill in the ‘missing gaps’ in the scrubbed dataset. The classic example of this was when then MIT student Latanya Sweeny was able to identify an individual using a scrubbed health records and cross-referencing it with voter-registration records.

Tokenization is another commonly used technique to anonymize sensitive data by replacing personally identifiable information such as a name with a token such as a numerical representation of that name. However, the token could be used  as a reference to the original data.

Sophisticated techniques to anonymize data

More sophisticated workarounds that help overcome the de-anonymization of data are differential privacy and k-anonymity.

Differential privacy uses mathematical mechanisms to add random noise to the original dataset to mask personally identifiable information, while making it possible to probabilistically return similar search results if you were to run the same query over the original dataset. An analogy is trying to disguise a toy panda with a horse head, creating just enough of a disguise to not recognise it’s a panda. When queried, it returns the counts of toys, which the disguised panda belongs to, without recognising an individual panda toy.

Apple, for example, has started using differential privacy with its iOS 10 devices to uncover patterns in user behaviour and activity without having to identify individual users. This allows Apple to analyse purchases, to web browsing history, to health data.

K-anonymity also aggregates data. It takes the approach of looking for k specified number of people that contain the same identifiable combination of attributes so that an individual is hidden within that group. Identifiable information such as age can be generalised so that age is replaced with an approximation such as less than 25 years of age or greater than 50 years of age. However, lack of randomisation to mask sensitive data means k-anonymity can be vulnerable to being hacked.

Remember: It’s your privacy, too

As data scientists, it can be easy to disassociate ourselves from data which is not personally our own, but other people’s. It can be easy to forget that the data we hold in our hands are not just endless records, but are the lives of the people who kindly gave up their data so that we could go about understanding the world better. Besides the serious legal consequences of breaching privacy, remember that it could be your personal life records in a stranger’s hands.


Rebecca Merrett

Writes technical blogs and other content for Wargaming Sydney/BigWorld Technology.

Rebecca MerrettLinkedIn
Raja Iqbal

Raja is the CEO and Chief Data Scientist at Data Science Dojo. He has worked at Microsoft Bing and Bing Ads in various research and development roles in data science and machine learning.

Raja IqbalLinkedIn

Follow us on: