Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

data anonymization

Data Science Dojo
Rebecca Merrett
| November 22

There’s more to data security and access control than granting teams within a company different access levels and issuing user passwords.

As data scientists, our job is not to run the whole security operation in our organizations to avoid a security breach. However, as we work very closely with data, we must understand the importance of having good, robust mechanisms in place to prevent sensitive and personally identifiable information from getting into the wrong hands, or from any cyber attack. Hence, the need for data security.

Strong passwords? Not enough

Setting ourselves up with a strong password might not cut it in today’s world. Some of the world’s biggest banks, which have an army of highly skilled security professionals, have suffered ever-more smarter cyber attacks. Today, users are logging into work systems and databases through biometrics such as fingerprint scanning technology on smartphones, laptops, and other devices or computers.

Two-factor authentication is also a popular mechanism of data security, which goes beyond simply identifying and authenticating a user through their password alone. Users are now logging into systems using a one-time password – which is sent to their work email, requiring another form of login – in combination with their fingerprint password. Generating a random number or token string each time a user logs into a system can reduce the risk of a single password being decrypted or obtained some other way.

Finishing the equation

User identity and authentication are only half of the equation, however. The other half is using anomaly detection algorithms or machine learning to pick up on unusual user activity and behavior once a user has logged on. This is something we as data scientists can bring to the table in helping our organizations better secure our customer or business data. Some of the key features of anomaly detection models include the time of access, location of access, type of activity or use of the data, device type, and how frequently a user accesses the database.

The model collects these data security points every time a user logs into the database and continuously monitors and calculates a risk score based on these data security points and how much they deviate from the user’s past logins. If the user reaches a high enough score, an automated mobile alert can be sent to the security team to further investigate or to take action.

Data security examples

Some obvious data security examples include a user who lives in Boston who logged out of the database 10 minutes ago but is now accessing the database in Berlin. Or, a user who usually logs in to the database during work hours is now logging in at 3 am.

Other examples include an executive assistant, who rarely logs into the database, and is now frequently logging into the database every 10 minutes. A data scientist, who usually aggregates thousands of rows of data is now retrieving a single row.

A marketer, who usually searches the database for contact numbers, is now attempting to access credit card information, even though that marketer already knows she/he does not have access to this information.

Another way data scientists can safeguard their customer or business data is to keep the data inside the database rather than exporting a subset or local copy of the data onto their computer or device. Nowadays, there are many tools to connect different database providers to R or Python, such as the odbcConnect() function as part of the RODBC library in R, which reads and queries data from a database using an ID and password rather than importing data from a local computer.

The ID and password can be removed from the R or Python file once the user has finished working with the data, so an attacker cannot run the script to get the data without a login. Also, if an attacker were to crack open a user’s laptop, he or she would not find a local copy of the data on that device.

Row and column access is another example of data security through fine-grained access controls. This mechanism masks certain columns or rows for different users. These masked columns or rows in tabled data usually contain sensitive or personally identifiable information. For example, the columns which contain financial information might be masked by the data science team but not by the finance/payments processing team.

Conclusion & other tips

Other ways to safely deal with sensitive and personally identifiable information include differential privacy and k-anonymity. To learn about these techniques, please read Dealing with data privacy – anonymization techniques.

Data Science Dojo
Raja Iqbal
| November 6

What do you think about your privacy? Do you wonder why data privacy and data anonymization are important? Read along to find all the answers. 

Internet companies of the world have the resources and power to be able to collect a microscopic level of detail on each and every one of its users and build their user profiles. In this day and age, it’s almost delusional to think that we still operate in a world that sticks by the good, old ideals of data privacy.

You have experienced, at some point in your life, a well-targeted email, phone call, letter, or advertisement.

Why should we care?

“If someone had nothing to hide, why should she/he care?” You have heard this argument before. Let’s use an analogy that explains why some people *do *care about privacy, despite having “nothing to hide”:

You just came home from a date. You are excited and can’t believe how awesome the person you are dating is. In fact, it feels too good to be true how this person just “gets you,” and it feels like he/she has known you for an exceptionally long time. However, as time goes by, the person you are dating starts to change and the romance wears off.

You notice from unintentionally glimpsing at your date’s work desk that there is a folder stuffed with your personal information. From your place of birth to your book membership status and somehow even your parents’ contact information! You realize this data was used to relate to you on a personal level.

The folder doesn’t contain anything that shows you are of bad character, but you still feel betrayed and hurt that the person you are dating disingenuously tried to create feelings of romance. As data scientists, we don’t want to be the date who lost another person’s trust, but we also don’t want to have zero understanding of the other person. How can we work around this challenge?

Learn more about data science for business leaders

Simple techniques to anonymize Data

A simple approach to maintaining personal data privacy when using data for predictive modeling or to glean insightful information is to scrub the data.

Scrubbing is simply removing personally identifiable information such as name, address, and date of birth. However, cross-referencing this with public data or other databases you may have access to could be used to fill in the “missing gaps” in the scrubbed dataset.

The classic example of this was when then MIT student Latanya Sweeny was able to identify an individual using a scrubbed health records and cross-referencing it with voter-registration records.

Tokenization is another commonly used technique to anonymize sensitive data by replacing personally identifiable information such as a name with a token such as a numerical representation of that name. However, the token could be used as a reference to the original data.

Sophisticated techniques to anonymize data

More sophisticated workarounds that help overcome the de-anonymization of data are differential privacy and k-anonymity.

data Privacy
Importance of privacy

Differential privacy

Differential privacy uses mathematical mechanisms to add random noise to the original dataset to mask personally identifiable information, while making it possible to probabilistically return similar search results if you were to run the same query over the original dataset. An analogy is trying to disguise a toy panda with a horse head, creating just enough of a disguise to not recognize it’s a panda.

When queried, it returns the counts of toys, which the disguised panda belongs to, without recognizing an individual panda toy.

Apple, for example, has started using differential data privacy with its iOS 10 devices to uncover patterns in user behavior and activity without having to identify individual users. This allows Apple to analyze purchases, web browsing history, and health data while maintaining your privacy.


K-anonymity also aggregates data. It takes the approach of looking for k specified number of people that contain the same identifiable combination of attributes so that an individual is hidden within that group. Identifiable information such as age can be generalized so that age is replaced with an approximation such as less than 25 years of age or greater than 50 years of age.

However, lack of randomization to mask sensitive data means k-anonymity can be vulnerable to being hacked.

Remember: It’s your data privacy, too

As data scientists, it can be easy to disassociate ourselves from data, which is not personally our own, but other people’s. It can be easy to forget that the data we hold in our hands are not just endless records but are the lives of the people who kindly gave up some of their data privacy so that we could go about understanding the world better.

Besides the serious legal consequences of breaching data privacy, remember that it could be your personal life records in a stranger’s hands.

Related Topics

Machine Learning
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Artificial Intelligence