There’s more to data security and access control than granting teams within a company different access levels and issuing user passwords.
As data scientists, our job is not to run the whole security operation in our organizations to avoid a security breach. However, as we work very closely with data, we must understand the importance of having good, robust mechanisms in place to prevent sensitive and personally identifiable information from getting into the wrong hands, or from any cyber attack. Hence, the need for data security.
Strong passwords? Not enough
Setting ourselves up with a strong password might not cut it in today’s world. Some of the world’s biggest banks, which have an army of highly skilled security professionals, have suffered ever-more smarter cyber attacks. Today, users are logging into work systems and databases through biometrics such as fingerprint scanning technology on smartphones, laptops, and other devices or computers.
Two-factor authentication is also a popular mechanism of data security, which goes beyond simply identifying and authenticating a user through their password alone. Users are now logging into systems using a one-time password – which is sent to their work email, requiring another form of login – in combination with their fingerprint password. Generating a random number or token string each time a user logs into a system can reduce the risk of a single password being decrypted or obtained some other way.
Finishing the equation
User identity and authentication are only half of the equation, however. The other half is using anomaly detection algorithms or machine learning to pick up on unusual user activity and behavior once a user has logged on. This is something we as data scientists can bring to the table in helping our organizations better secure our customer or business data. Some of the key features of anomaly detection models include the time of access, location of access, type of activity or use of the data, device type, and how frequently a user accesses the database.
The model collects these data security points every time a user logs into the database and continuously monitors and calculates a risk score based on these data security points and how much they deviate from the user’s past logins. If the user reaches a high enough score, an automated mobile alert can be sent to the security team to further investigate or to take action.
Data security examples
Some obvious data security examples include a user who lives in Boston who logged out of the database 10 minutes ago but is now accessing the database in Berlin. Or, a user who usually logs in to the database during work hours is now logging in at 3 am.
Other examples include an executive assistant, who rarely logs into the database, and is now frequently logging into the database every 10 minutes. A data scientist, who usually aggregates thousands of rows of data is now retrieving a single row.
A marketer, who usually searches the database for contact numbers, is now attempting to access credit card information, even though that marketer already knows she/he does not have access to this information.
Another way data scientists can safeguard their customer or business data is to keep the data inside the database rather than exporting a subset or local copy of the data onto their computer or device. Nowadays, there are many tools to connect different database providers to R or Python, such as the odbcConnect() function as part of the RODBC library in R, which reads and queries data from a database using an ID and password rather than importing data from a local computer.
The ID and password can be removed from the R or Python file once the user has finished working with the data, so an attacker cannot run the script to get the data without a login. Also, if an attacker were to crack open a user’s laptop, he or she would not find a local copy of the data on that device.
Row and column access is another example of data security through fine-grained access controls. This mechanism masks certain columns or rows for different users. These masked columns or rows in tabled data usually contain sensitive or personally identifiable information. For example, the columns which contain financial information might be masked by the data science team but not by the finance/payments processing team.
Conclusion & other tips
Other ways to safely deal with sensitive and personally identifiable information include differential privacy and k-anonymity. To learn about these techniques, please read Dealing with data privacy – anonymization techniques.