As data scientists, we are all in this to pursue the objective truth, or close to it. Check your ethics with these bad data ethics examples.
You’ve probably come across this before:
A vendor skews a graph that compares their product with a competitor’s in the market.
A survey conveniently shows that most respondents unanimously agree on an issue.
A cosmetic company claims their new “miracle cream” has been “scientifically tested.”
While these examples may seem silly to some, misleading analysis is a genuine issue that often has profound consequences. Ethical concerns arise when data scientists don’t follow good practices when collecting, handling, presenting, and modeling data.
As an aspiring professional in data science, your personal viewpoint should not matter.
As data scientists, we are all in this to pursue the objective truth, or close to it. This is where data ethics comes in. We want to find out and discover things that improve our understanding of the world and the people around us, and to better predict our future.
This is not only a mantra: it’s a way of thinking that every data scientist should adopt if she or he is going to be successful in their role. Your personal subjective viewpoint can get in the way of being a good data scientist.
There’s a saying that your model is only as good as your data. This also means that any conclusions you make about certain groups of people or how the world works depends on whether good data ethics collection practices were used.
For example, you might come across a model that was based on “race” as being a heavily weighted predictor variable.
There are two issues with this:
First, the model just so happens to classify people of a certain race as all being high credit risk applicants for a home loan at a bank. However, when looking closer at the actual data, it’s apparent that most cases are from one racial group, with all these cases living in the same part of the city or location.
How different the results be if there was a more diverse random sample of cases, across all locations? What if there were many cases of this racial group living in other locations with a good credit history that just didn’t make it into the dataset?
Also, when it comes to classification tasks, if there is extreme imbalance of classes in the dataset, the model will tend to correctly predict the majority class most of the time but will struggle to predict the under-represented classes.
Second, why did the bank decide to place such a heavy weight on the predictor variable “race”? Are the results different when race is not heavily weighted? Was this decision driven by a personal viewpoint, or was there a non-subjective reason behind placing a heavy weight on race?
It could be that the reason behind this decision is purely subjective and skews the results, therefore making any conclusions negligible.
Bad data ethics examples
Studies that make conclusions about crime rates among certain ethnic or socioeconomic status groups are another example where data ethics are a concern. Why is it that some studies only use data on certain cities and not others? Could it be that crime in these carefully selected cities are likely to falsely prove a subjective viewpoint and make wrong conclusions about a group overall?
What ever happened to the good old practice of getting a random sample across the entire population before you even thought about using the data to make conclusions about the entire group?
Consequently, deliberately excluding certain cases from the analysis, without any reason to believe the data is incorrect or inaccurate is a problem. Also, wrangling the data in a way to try and prove a viewpoint is another ethical deal breaker.
For example, let’s say you came across a statistical significance test that shows men and women math students are significantly different when it comes to learning mathematics. However, the test is based on all men included in the dataset, and all women excluding a few outliers, with some cases of women merged into one case with their computed average.
It’s important, because this could result in incorrectly rejecting the null hypothesis of there being no real difference in favor of the bogus claim that one gender is better at math than the other.
In conclusion, these examples of bad data ethics should be front of mind when collecting, cleaning, wrangling and modeling data, so that our conclusions are not based on false “truth.”‘
Finally, think about it this way: how would you feel if someone painted a misleading picture of you based on a subjective viewpoint and tried to label it as “fact”?
Big data brings big responsibilities. Data is changing the way we live our lives and do business. Here are 10 controversial data science experiments.
Data science is changing the game when it comes to manipulating data sets and visualizing big data. Knowing how to conduct a successful data science experiment while learning data science is critical for businesses if they want to effectively target and understand their customers. With these experiments, comes responsibility for understanding big data ethics.
There is so much data out in the world that many people are overwhelmed by its sheer magnitude of it. Also, most people have no idea how powerful that information truly is. Yes, data experiments have the potential to improve their lives. At the same time, how do companies not step on toes with their data usage and application?
Data science experiments are typically utilized to not only answer questions a business may have but helps businesses to create those questions, to begin with. We’ve compiled a list of some of the most controversial data science experiments that have raised questions about the use (and misuse) of big data.
1. Target’s pregnancy prediction
Let’s first look at one of the most notorious examples of the potential of predictive analytics. It’s well known that every time you go shopping, retailers are taking note of what you buy and when you’re buying it. Your shopping habits are tracked and analyzed based on what time you go shopping, if you use digital coupons vs paper coupons, buy brand name or generic, and so much more. Your data is stored in internal databases where it’s being picked apart in an effort to find trends between your demographics and buying habits (supposedly in an effort to serve your needs better).
Your retailer arguably knows more about your consumption habits than you do – or they are sure trying to. One Minneapolis man learned of his daughter’s pregnancy because Target effectively targeted his daughter with coupons for baby products once her buying habits assigned her a high “pregnancy prediction” score. This caused a social media firestorm, because Target did not violate any privacy policies, but did that mean this private, life-event was appropriately targeted? The predictive analytics project was successful, but the public thought the targeted marketing was a little too invasive and direct. This was one of the most known cases of big data ethics, and the potential misuse.
2. Allstate telemetry packages
Second, let’s talk insurance. Car insurance premiums can make or break the bank. This is especially true depending on one’s driving record. For the most part, it’s easy to find an insurance company to ensure you (even if your driving record is less than desirable). Within the next decade, expect to see major changes in how insurance premiums are determined. One of the leading companies in this change is Allstate.
Allstate’s Drivewise package offers (mostly good) drivers the chance to save money based on their driving habits. The only caveat here is that Allstate will install a telematics tracking device in your vehicle to obtain this information. Your braking, speeding, and even call center data can potentially be used to determine your premiums. If you’re a good driver, this might be great news for you, but some concerns get raised when it comes to GPS tracking. How ethically sound is this practice of using your driving data? This potentially identifiable information needs to be diligently safeguarded, but the real concern is how GPS tracking will affect people from poorer areas.
Car insurance companies can rate roads by how safe they are. If people from poorer areas are surrounded by roads with a less “safe” rating, and they spend 60% of their time driving on this, how much will this negatively affect their insurance premiums? Will their good driving record be enough to save them from outrageous premiums? What other data will be used- Tweets and other social media posts? All good questions to consider, when looking at big data ethics.
3. OkCupid data scrape
In 2016, almost 70,000 Okcupid profiles had their data released onto the Open Science Framework. This place is an online community where people share raw data and collaborate with each other over data sets. Two Danish researchers, Emil Kirkegaard and Julius Daugbjerg-Bjerrekaer scraped the data with a bot profile on OkCupid and released publicly identifiable information such as age, gender, sexual orientation, and personal responses to the survey questions the website asks when people sign up for a profile.
More importantly, the two researchers didn’t feel their actions were explicitly or ethically wrong, because “Data is already public.” This huge data release raised eyebrows and forced questions about the ethics of releasing “already public” data. What does big data ethics have to say about already public data? What’s off-limits? The main concern raised was that even though data may be public, that doesn’t mean someone consents to personally identifiable data being published on an online forum. Ethically, not okay in the public’s eyes.
4. The Wikipedia likelihood of success
Former Google data scientist, Seth Stephens-Davidowitz, wanted to take a look into what factors lead to successful people become successful. Stephens-Davidowitz was interested in finding components in people’s lives that made them successful (or just prominent enough to have Wikipedia pages). To delve into this issue, he downloaded over 150,000 Wikipedia pages to comprise his initial set of data.
His findings were that people who grew up in larger population towns near universities were more likely to be successful, and those towns required a lot of diversity; more successful people came out of towns that had high populations of immigrants and where creativity dealing with the arts was highly supported. For some people, the idea of promoting the arts, subsidizing education, and promoting more immigration may not be items on their high priority list. This example is a little different from some of the other examples. It doesn’t cause turmoil in the world of big data ethics, but its finding weren’t necessarily agreed upon.
5. Big data and the credit gap
A big part of the “American Dream” is being able to climb up the ladder of success and financially provide for yourself and your loved ones. Your credit report and history will affect huge financial decisions in your life; it’s a number that will follow you the rest of your life, and its scope reaches far beyond what type of interest rates you can get for loans. Most Americans don’t comprehend everything that goes into their credit score makeup, and according to a Yale Journal of Law and Technology article, “traditional, automated credit scoring tools raise longstanding concerns of accuracy and fairness.” In the advent of big data ethics, alternative ways of credit scoring are rising up—but with their own share of ethical concerns.
The growth mindset of “all data is credit data” attempts to benefit underserved consumers, by using algorithms to detect patterns in behavior. Unfortunately, the “all data is credit data” pulls data points from consumers’ behavior online and offline. The problem with this is no one knows exactly how they are being scored, just that any data points are fair game. This poses the risk of being given an unfair credit score, with the little foundation to stand on when it comes to disputing inaccurate scoring data.
The lack of transparency makes people wonder how objective credit scoring really is: will I be judged based on my social media presence, friends, church attendance, ethnicity or sex? Chances are, you already are. As for big data ethics, the general public doesn’t like this use of their data. Another concern is the accuracy of the data, which can affect major financial decisions and offers in your life, in some instances inaccurate data can severely hinder your ability to move forward financially.
6. Big data ethics and AI “Beauty Contest”
In 2016, the first AI (artificial intelligence) judged beauty contest selected 44 winners from the internet. The selection of winners raised concerns because of the 6,000 submitted photos from over 100 countries, only a handful were non-white. One person of color was selected as a winner, and the rest of the non-white winners were Asian. The obvious problem with this, was that a majority of photo submissions came from Africa and India.
The company who put on this internet beauty contest, Beauty.AI said this project was a “deep learning” project that was sponsored in part by Microsoft. Chief Science Officer, Alex Zhavoronkov of Beauty.AI claimed the algorithm used was biased because the data they trained it on was not diverse enough. For future projects, the hope is to correct the problem of bias by employing more sets of data and designing algorithms that can explain any bias.
7. Self-driving vehicles
Earlier in 2018, an Uber self-driving vehicle struck and killed an Arizona woman, and instantly social media was up in arms. Self-driving vehicles are purposefully being designed and created to avoid accidents like this, and this accident (the first of its kind) brought up serious ethical dilemmas regarding the algorithms being designed for these vehicles.
What is the role of the vehicle, if the vehicle is about to be involved in a crash? Does the vehicle protect the people inside of it at all costs? Does the vehicle avoid the pedestrian at all costs (even if it means danger for the vehicle passengers)? Does the number of people in the vehicle vs the number of pedestrians about to be hit weigh in? All of these questions need to be answered before self-driving vehicles can take part in society.
8. Northpointe’s risk assessment
In the United States, the court systems are increasingly becoming reliant upon algorithms to determine the likelihood of recidivism among criminals. Northpointe, Inc. has a program called COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) that is employed in multiple states, and is used to give a “risk assessment” score.
To put it simply, COMPAS scores criminals on the likelihood of them being reoffenders in the future; the higher the score, the higher the likelihood to reoffend. In a 2016 analysis done by ProPublica, after looking at 10,000 criminals in Broward Co. Florida, black defendants were mistakenly given higher “scores” more often than their white counterparts. Further, white criminals were often given a score lower than they should have been (they turned out to be more “risky” than they were perceived to be). Northpointe has since denied any racial bias that may be present in their algorithms, but the controversy in using a potentially racist algorithm raises concerns. As for big data ethics, this case is greatly frowned upon.
9. 23andMe genomics
23andMe is a company that launched in 2006 with the goal of helping people understand their DNA and genetic makeup on a personal level that has never been accessible before. For $100 and a little bit of saliva, people could receive information on whether or not they had one or more of the 100 risk factors 23andMe had identified. According to a 2015 Fast Company article, customers who opt in can consent to have their data shared with drug and insurance corporations, or even academic labs. According to a quote from Matthew Herper’s Forbes article, “23andMe can only scan the genome for known variations,” but their recent partnership with the personal genetics company, Genentech, would like to pay for access to all of the data 23andMe has (that people have consented to, of course).
Partnerships with these paying corporations and labs have the power to data mine and find patterns in sequences in data at a cost far cheaper than traditional experiments, but the real cost is privacy. The concern is that these pharmaceutical companies, academic labs, and government entities can potentially know more about you on a cellular level, than you could ever know about yourself. Some feel that this is an overreach as far as big data ethics go. It has the potential to be misused on a massive scale.
10. Microsoft tay bot
In March 2016, Microsoft released a chat bot named “Tay” on Twitter. Tay wasmeant to talk like a teenager, but lasted less than a day, after Tay started tweeting hateful and racist content on social media. As an artificial intelligence machine, Tay learned how to communicate with people based on who she was talking to. After shutting Tay down for her racist comments, Microsoft argued that the racist tweets were due in part to online “trolls” who were trying to force Tay into racist conversations.
Since 2016, Microsoft has made adjustments to their AI models, and has released a new “lawyer bot” that can help people with legal advice online. According to a spokeswoman, the problem with Tay had to do with the “content neutral algorithm,” and important questions such as “how can this hurt someone?” need to be asked before deploying these types of AI projects.
As you can see, the use of big data ethics is changing the landscape of how businesses interact, reach, and successfully target consumer groups. While these arguably controversial data science experiments are pushing technology and data insight to the next level, there is still a long way to go. Companies will have to ask themselves questions about the morality of algorithms, the purpose of their machine learning, and whether or not their experiment is ethically sound.