For a hands-on learning experience to develop LLM applications, join our LLM Bootcamp today.
First 4 seats get an early bird discount of 30%! So hurry up!
Ethics in research and A/B testing is essential. A/B testing might not be as simple and harmless as it looks. Learn how to take care of ethical concerns in A/B tests.
The ethical way to A/B testing
We have come a long way since the days of horrific human experiments during World Wars, the Stanford prison experiment, the Guatemalan STD Study, and many more where inhumane treatments were all in the name of science.
As data scientists, we are always experimenting – not only with our models or formulas but also with the responses from our customers. A/B tests or randomized experiments may require human subjects, who are willing to undertake a trial or treatment such as seeing certain content when using a Web app, or undergoing a certain exercise regime.
Facebook example
What may initially seem like a harmless experiment, might cause harm or distress. For example, Facebook’s experiment of provoking negative emotions from some users and positive emotions from others could have grave consequences. If a user, who was experiencing emotional distress happened to have seen content that provoked negative feelings, it could spur on a tragic event such as physical harm.
Careful understanding of our experiments and our test subjects may prevent inappropriate testing required prior to implementing our research, or products and services. Consent is the best tool to assist data scientists working with data generated by people. Similarly, to guidelines for clinical trials, it’s informed consent specifically that is needed to avoid potential unintended consequences of experiments.
If an organization specializing in exercise science accepted participation from a person who has a high risk of heart failure and did not ask for a medical examination before experimenting, then the organization is potentially liable for the consequences.
Often a simple, harmless A/B test might not be as simple and harmless as it looks. So how do we ensure we are not putting our human subjects’ well-being and safety in danger when we conduct our research and experiments?
First steps in research
The first port of call is using informed user consent. This doesn’t mean pages and pages of legal jargon on sign-up or being vague in an email when reaching out for volunteers for your study. This could rather be a popup window or email that is clear on the purpose of the experiment and any warnings or potential risks the person needs to be aware of.
Depending on how intense the treatment is, a medical or psychological examination is a good idea to ensure that the participant can cope with the given treatment. Being unaware of people’s vulnerabilities can lead to unintended consequences. This can be avoided through clearer warnings or the next level up which may be online assessments or even expert examinations.
The next step in ensuring your A/B test or experiment runs smoothly and ethically is making sure you understand local and federal regulations around conducting research experiments on humans. In the US, these regulations have been outlined above. The regulations mainly look at:
Informed consent, with a full explanation of any potential risks to the subject.
Providing additional safeguards for vulnerable populations such as children, mentally disabled people, mentally ill people, economically disadvantaged people, pregnant women, and so on.
Government-funded experiments need the approval of an Institutional Review Board or an independent ethics committee before conducting experiments.
During the A/B test or experiment, it’s also a good idea to regularly check in and see how your subjects are responding to the treatments, not only for scientific research but also to quickly solve any health or well-being issues.
This could be in the form of a short popup survey or email to check if the user is safe and well, or face-to-face consulting. Also, having an opt-out option allows the subject to take control if they feel their health or well-being is at risk. Having some people opt out might seem inconvenient for your study, but a serious or tragic incident as a result of a participant having to go through the full course of the treatment is a far worse outcome.
Observational studies might be a good alternative if the above steps are in no way feasible for your experiment. Observational studies are limited when making conclusions, and only real experiments allow you to make confident conclusions from the data. However, in some situations, it is not possible nor ethical to force treatments onto subjects.
For example, it’s not ethical to inject cancer cells into random subjects, but you can study cancer patients with the inherited attributes you are looking for to help with your research.
The ethical takeaway
It is understood that there can be some overhead in carefully preparing, setting up, and following ethical guidelines for an experiment or A/B test. However, the serious consequences of not doing it properly, as well as public distrust, will only lead to a reluctance to share data, hindering our ability to effectively do our work.
If you’re curious to learn more about A/B testing, watch the short video below.
Just like humans, algorithms can develop bias and make skewed decisions. What are these biases and how do they impact decision-making?
An algorithmic bias in making
If we took a hard look at every model ever built for classifying who is the optimal candidate for:
A credit loans
A job promotion
A free scholarship or
Any other opportunity,
would we see a pattern in certain groups of people being granted these opportunities over others? Are our algorithms and formulas biased?
Understanding the problem
Would we see these models repeatedly make decisions about who should be part of the “have” and “have not” groups? Further, do these models truly pick the optimal candidate? Instead, might they pick according to what someone personally thinks is the optimal candidate?
Research groups like AI Now recently launched an initiative to fight algorithmic bias. As a result, it’s bringing the issues to light. It’s crucial that we as data scientists keep our algorithms in check. This is to avoid developing yet another tool that is used to discriminate against people.
So how can we keep our algorithms in check?
In recent years, researchers have come up with ways to detect if a model is biased in its decisions about people. A 2016 paper called Equality of Opportunity in Supervised Learning proposes a framework.
This framework uses “equalized odds and equal opportunity” as a criterion for assessing a model’s fairness when classifying people. This criterion allows features to predict an outcome or class (such as predicting a “high credit risk applicant”).
It prohibits abusing a particular attribute of a person (such as race) to do this. The model must be equally accurate in all demographics. Consequently, it is punished if it only performs well for the majority of people. This means that the predicted outcome must have equal true positive/negative rates and false positive/negative rates across all demographics.
The framework is conducted as a post-learning step. Therefore, it doesn’t require modifying the algorithm or model itself. Then it assesses whether the results from a model seem skewed towards a group of people. For example, a flawed model makes it harder for African Americans who do pay back their loans to apply for loans.
This model makes it easier for Caucasians who don’t pay back their loans to apply for loans. Therefore, this framework ensures that this kind of model would be determined as unfair, as it would not result in equal false positive/negative rates for both African Americans and Caucasians.
The framework also overcomes the problem of loss of utility when using demographic parity. This requires a predicted outcome to be independent of a particular sensitive attribute.
Using the framework, the predicted outcome is allowed to depend on a particular attribute, but only through the actual outcome. This prevents the attribute from being a proxy to the actual outcome while avoiding loss of utility.
Predictor variables and skewed data
Another framework for detecting algorithmic bias is testing how different predictor variables or attributes might skew the predicted outcome. A 2017 paper called Counterfactual Fairness shows how different variables influenced the results of the 2014 stop-and-frisk New York City police initiative.
The data showed that the police officers mostly stopped and frisked African Americans and Hispanics. This happened despite most of those people being innocent or not as suspicious as predicted.
Subsequently, actual incidents of crime were similar across all races. When considering all predictor variables, including the race attribute, the model learned to correlate race with the criminality outcome. Then, the researchers were able to get a more accurate spotting of criminals. Researchers used variables that only related to a person’s criminality.
This was instead of if they had of built a model highly dependent on race and appearance.
The Takeaway
First, this research shows that relying on race as a predictor leads to a skewed outcome. Second, it also shows how ineffective the police would be by allowing such bias to be at the core of their decisions.
Visualizations of the predicted versus the actual data show how some locations with a high number of arrests could be completely missed if they were to depend on race. How we construct our models and the variables we use can truly affect people’s opportunities, livelihood, and overall well-being. Therefore, this must be handled ethically and responsibly.
As data scientists, our philosophy should be built on the pursuit of truth, not the manipulation of models to find the most convenient or profitable results at all costs, even at the cost of our ethics.
We must include bias assessments as part of the process, so we can be more confident that our models are designed to better our understanding of people and make smarter decisions, not dumb and discriminatory decisions.
Maybe your boss isn’t Bill Lumberg, but if his understanding of analytics is limited to green and red (ad hoc), chances are you’ve rolled your eyes more than once.
Hello Peter, what’s happening? Ummm, I’m going to need you to go ahead and come in tomorrow to build that report. So, if you could be here around 9 that would be great, mmmk… oh oh! and I almost forgot ahh, I’m also going to need you to go ahead and come in on Sunday too, kay? We ahh don’t understand why our sales dropped this week and ah, we need to play catch up and analyze it.
Honestly how many times has this happened to you? Maybe your boss isn’t a Bill Lumberg, but if his understanding of analytics is limited to green = good and red = bad, chances are you’ve rolled your eyes in disgust more than once.
It happens a lot with Ad hoc analysis
And you’re not alone. Ad hoc analytics requests can make up 50% of an analytics team’s time. So what is a pragmatic analyst to do? According to Phil Kemelor, one strategy is to adopt a “don’t say yes” approach. Before committing to a request, ask yourself: • What is the business reason behind the request? • Will this kind of analysis answer the business question? • Do we know how long this will take? • Can we fit this into our funnel?
In other words, having an intake process to prioritize analytics requests can save teams a lot of weekend work.
When deciding whether to commit or pushback on a request, it’s also important to remember that your efforts will be wasted if the analysis cannot be acted upon. In other words, what will the business unit be able to do when they get access to your analysis?
Will the organization be able to make changes to improve a bad situation? Nothing is more frustrating than spending a weekend building a report that subsequently gets printed out and put on a shelf to collect dust.
1. Knowing the difference between driving the car and fixing the engine
Brent Dykes also emphasizes the importance of understanding the difference between reporting and analysis. Reporting is the process of organizing data to monitor performance; analysis is the process of exploring data and reports to extract insights.
The former helps a company ensure that everything is running well, the latter is an investigative tool used to figure out what’s going on “underneath the hood.” Organizations that don’t understand the difference between the two are more susceptible to ask for more ad hoc requests.
2. Doesn’t self-serve solve this problem?
What about self-service tools? After all, if any employee has the potential to become a citizen data scientist, then the demand for ad hoc requests should drop, right? Perhaps, but the costs to the organization might outweigh the benefits. Literally.
The ad hoc reporting promise fails when ad hoc reports: • Are treated like official reports shared broadly across the organization• Perform shallow analysis that lacks real insight• Are subject to the author’s own confirmation bias
3. Take a stand, for the right reason
Not everyone has the luxury of saying no to their Bill Lumberg. But, as a recognized data expert in your organization you do have something much more powerful – credibility.
In the long run, this means that you have the ability to shape your company’s data strategy, and ultimately wean the business off random ad hoc analysis. Start flexing those muscles today.
Statistical distributions help us understand a problem better by assigning a range of possible values to the variables, making them very useful in data science and machine learning. Here are 7 types of distributions with intuitive examples that often occur in real-life data.
Whether you’re guessing if it’s going to rain tomorrow, betting on a sports team to win an away match, framing a policy for an insurance company, or simply trying your luck on blackjack at the casino, probability, and distributions come into action in all aspects of life to determine the likelihood of events.
If you’re interested in learning how these come to life in advanced applications, you might find our LLM Bootcamp to be a great resource to deepen your understanding.
Having a sound statistical background can be incredibly beneficial in the daily life of a data scientist. Probability is one of the main building blocks of data science and machine learning. While the concept of probability gives us mathematical calculations, statistical distributions help us visualize what’s happening underneath.
Having a good grip on statistical distribution makes exploring a new dataset and finding patterns within a lot easier. It helps us choose the appropriate machine-learning model to fit our data and speed up the overall process.
In this blog, we will be going over diverse types of data, the common distributions for each of them, and compelling examples of where they are applied in real life.
Before we proceed further, if you want to learn more about probability distribution, watch this video below:
Common Types of Data
Explaining various distributions becomes more manageable if we are familiar with the type of data they use. We encounter two different outcomes in day-to-day experiments: finite and infinite outcomes.
When you roll a die or pick a card from a deck, you have a limited number of outcomes possible. This type of data is called Discrete Data, which can only take a specified number of values. For example, in rolling a die, the specified values are 1, 2, 3, 4, 5, and 6.
Similarly, we can see examples of infinite outcomes from discrete events in our daily environment. Recording time or measuring a person’s height has infinitely many values within a given interval. This type of data is called Continuous Data, which can have any value within a given range. That range can be finite or infinite.
For example, suppose you measure a watermelon’s weight. It can be any value from 10.2 kg, 10.24 kg, or 10.243 kg. Making it measurable but not countable; hence, it is continuous. On the other hand, suppose you count the number of boys in a class; since the value is countable, it is discreet.
Types of Statistical Distributions
Depending on the type of data we use, we have grouped distributions into two categories, discrete distributions for discrete data (finite outcomes) and continuous distributions for continuous data (infinite outcomes).
Discrete Uniform Distribution: All Outcomes are Equally Likely
In statistics, uniform distribution refers to a statistical distribution in which all outcomes are equally likely. Consider rolling a six-sided die. You have an equal probability of obtaining all six numbers on your next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6, equaling a probability of 1/6, hence an example of a discrete uniform distribution.
As a result, the uniform distribution graph contains bars of equal height representing each outcome. In our example, the height is a probability of 1/6 (0.166667).
Uniform distribution is represented by the function U(a, b), where a and b represent the starting and ending values, respectively. Similar to a discrete uniform distribution, there is a continuous uniform distribution for continuous variables.
The drawbacks of this distribution are that it often provides us with no relevant information. Using our example of a rolling die, we get the expected value of 3.5, which gives us no accurate intuition since there is no such thing as half a number on a dice. Since all values are equally likely, it gives us no real predictive power.
Bernoulli Distribution: Single-trial with Two Possible Outcomes
The Bernoulli distribution is one of the easiest distributions to understand. It can be used as a starting point to derive more complex distributions. Any event with a single trial and only two outcomes follows a Bernoulli distribution. Flipping a coin or choosing between True and False in a quiz are examples of a Bernoulli distribution.
They have a single trial and only two outcomes. Let’s assume you flip a coin once; this is a single trail. The only two outcomes are either heads or tails. This is an example of a Bernoulli distribution.
Usually, when following a Bernoulli distribution, we have the probability of one of the outcomes (p). From (p), we can deduce the probability of the other outcome by subtracting it from the total probability (1), represented as (1-p).
It is represented by bern(p), where p is the probability of success. The expected value of a Bernoulli trial ‘x’ is represented as, E(x) = p, and similarly, Bernoulli variance is, Var(x) = p(1-p).
The graph of a Bernoulli distribution is simple to read. It consists of only two bars, one rising to the associated probability p and the other growing to 1-p.
Binomial Distribution: A Sequence of Bernoulli Events
The Binomial Distribution can be thought of as the sum of outcomes of an event following a Bernoulli distribution. Therefore, Binomial Distribution is used in binary outcome events, and the probability of success and failure is the same in all successive trials. An example of a binomial event would be flipping a coin multiple times to count the number of heads and tails.
The difference between these distributions can be explained through an example. Consider you’re attempting a quiz that contains 10 True/False questions. Trying a single T/F question would be considered a Bernoulli trial, whereas attempting the entire quiz of 10 T/F questions would be categorized as a Binomial trial. The main characteristics of Binomial Distribution are:
Given multiple trials, each of them is independent of the other. That is, the outcome of one trial doesn’t affect another one.
Each trial can lead to just two possible results (e.g., winning or losing), with probabilities p and (1 – p).
PRO TIP: Join our data science bootcamp program today to enhance your data science skillset!
A binomial distribution is represented by B (n, p), where n is the number of trials and p is the probability of success in a single trial. A Bernoulli distribution can be shaped as a binomial trial as B (1, p) since it has only one trial. The expected value of a binomial trial “x” is the number of times a success occurs, represented as E(x) = np. Similarly, variance is represented as Var(x) = np(1-p).
Let’s consider the probability of success (p) and the number of trials (n). We can then calculate the likelihood of success (x) for these n trials using the formula below:
For example, suppose that a candy company produces both milk chocolate and dark chocolate candy bars. The total products contain half milk chocolate bars and half dark chocolate bars. Say you choose ten candy bars at random and choosing milk chocolate is defined as a success. The probability distribution of the number of successes during these ten trials with p = 0.5 is shown here in the binomial distribution graph:
Poisson Distribution: The Probability that an Event May or May not Occur
Poisson distribution deals with the frequency with which an event occurs within a specific interval. Instead of the probability of an event, Poisson distribution requires knowing how often it happens in a particular period or distance. For example, a cricket chirps two times in 7 seconds on average. We can use the Poisson distribution to determine the likelihood of it chirping five times in 15 seconds.
A Poisson process is represented with the notation Po(λ), where λ represents the expected number of events that can take place in a period. The expected value and variance of a Poisson process is λ. X represents the discrete random variable. A Poisson Distribution can be modeled using the following formula.
The main characteristics which describe the Poisson Processes are:
The events are independent of each other.
An event can occur any number of times (within the defined period).
Two events can’t take place simultaneously.
The graph of Poisson distribution plots the number of instances an event occurs in the standard interval of time and the probability of each one.
Continuous Distributions
Normal Distribution: Symmetric Distribution of Values Around the Mean
Normal distribution is the most used distribution in data science. In a normal distribution graph, data is symmetrically distributed with no skew. When plotted, the data follows a bell shape, with most values clustering around a central region and tapering off as they go further away from the center.
The normal distribution frequently appears in nature and life in various forms. For example, the scores of a quiz follow a normal distribution. Many of the students scored between 60 and 80 as illustrated in the graph below. Of course, students with scores that fall outside this range are deviating from the center.
Here, you can witness the “bell-shaped” curve around the central region, indicating that most data points exist there. The normal distribution is represented as N(µ, σ2) here, µ represents the mean, and σ2 represents the variance, one of which is mostly provided. The expected value of a normal distribution is equal to its mean. Some of the characteristics which can help us to recognize a normal distribution are:
The curve is symmetric at the center. Therefore mean, mode, and median are equal to the same value, distributing all the values symmetrically around the mean.
The area under the distribution curve equals 1 (all the probabilities must sum up to 1).
While plotting a graph for a normal distribution, 68% of all values lie within one standard deviation from the mean. In the example above, if the mean is 70 and the standard deviation is 10, 68% of the values will lie between 60 and 80. Similarly, 95% of the values lie within two standard deviations from the mean, and 99.7% lie within three standard deviations from the mean. This last interval captures almost all matters. If a data point is not included, it is most likely an outlier.
Student t-Test Distribution: Small Sample Size Approximation of a Normal Distribution
The student’s t-distribution, also known as the t distribution, is a type of statistical distribution similar to the normal distribution with its bell shape but has heavier tails. The t distribution is used instead of the normal distribution when you have small sample sizes.
For example, suppose we deal with the total number of apples sold by a shopkeeper in a month. In that case, we will use the normal distribution. Whereas, if we are dealing with the total amount of apples sold in a day, i.e., a smaller sample, we can use the t distribution.
Another critical difference between the student’s t distribution and the Normal one is that apart from the mean and variance, we must also define the degrees of freedom for the distribution. In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. A Student’s t distribution is represented as t(k), where k represents the number of degrees of freedom. For k=2, i.e., 2 degrees of freedom, the expected value is the same as the mean.
Degrees of freedom are in the left column of the t-distribution table.
Overall, the student t distribution is frequently used when conducting statistical analysis and plays a significant role in performing hypothesis testing with limited data.
Exponential Distribution: Model Elapsed Time between Two Events
Exponential distribution is one of the widely used continuous distributions. It is used to model the time taken between different events.
For example, in physics, it is often used to measure radioactive decay; in engineering, to measure the time associated with receiving a defective part on an assembly line; and in finance, to measure the likelihood of the next default for a portfolio of financial assets. Another common application of Exponential distributions in survival analysis (e.g., expected life of a device/machine).
The exponential distribution is commonly represented as Exp(λ), where λ is the distribution parameter, often called the rate parameter. We can find the value of λ by the formula = 1/μ, where μ is the mean. Here, the standard deviation is the same as the mean. Var (x) gives the variance = 1/λ2
An exponential graph is a curved line representing how the probability changes exponentially. Exponential distributions are commonly used in calculations of product reliability or the length of time a product lasts.
Conclusion
Data is an essential component of the data exploration and model development process. The first thing that springs to mind when working with continuous variables is looking at the data distribution. We can adjust our machine-learning models to best match the problem if we can identify the pattern in the data distribution, which reduces the time to get to an accurate outcome.
Indeed, specific Machine Learning models are built to perform best when certain distribution assumptions are met. Knowing which distributions, we’re dealing with may thus assist us in determining which models to apply.