In today’s digital age, with a plethora of tools available at our fingertips, researchers can now collect and analyze data with greater ease and efficiency. These research tools not only save time but also provide more accurate and reliable results. In this blog post, we will explore some of the essential research tools that every researcher should have in their toolkit.
From data collection to data analysis and presentation, this blog will cover it all. So, if you’re a researcher looking to streamline your work and improve your results, keep reading to discover the must-have tools for research success.
Revolutionize your research: The top 20 must-have research tools
Research requires various tools to collect, analyze and disseminate information effectively. Some essential research tools include search engines like Google Scholar, JSTOR, and PubMed, reference management software like Zotero, Mendeley, and EndNote, statistical analysis tools like SPSS, R, and Stata, writing tools like Microsoft Word and Grammarly, and data visualization tools like Tableau and Excel.
1. Google Scholar – Google Scholar is a search engine for scholarly literature, including articles, theses, books, and conference papers.
2. JSTOR – JSTOR is a digital library of academic journals, books, and primary sources.
3.PubMed – PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics.
4. Web of Science: Web of Science is a citation index that allows you to search for articles, conference proceedings, and books across various scientific disciplines.
5. Scopus – Scopus citation database that covers scientific, technical, medical, and social sciences literature.
6. Zotero: Zotero is a free, open-source citation management tool that helps you organize your research sources, create bibliographies, and collaborate with others.
7. Mendeley – Mendeley is a reference management software that allows you to organize and share your research papers and collaborate with others.
8. EndNote – EndNoted is a software tool for managing bibliographies, citations, and references on the Windows and macOS operating systems.
9. RefWorks – RefWorks is a web-based reference management tool that allows you to create and organize a personal database of references and generate citations and bibliographies.
10. Evernote –Evernote is a digital notebook that allows you to capture and organize your research notes, web clippings, and documents.
11. SPSS – SPSS is a statistical software package used for data analysis, data mining, and forecasting.
12. R – R is a free, open-source software environment for statistical computing and graphics.
13. Stata –Stata is a statistical software package that provides a suite of applications for data management and statistical analysis.
Other helpful tools for collaboration and organization include NVivo, Slack, Zoom, and Microsoft Teams. With these tools, researchers can effectively find relevant literature, manage references, analyze data, write research papers, create visual representations of data, and collaborate with peers.
14. Excel –Excel is spreadsheet software used for organizing, analyzing, and presenting data.
15. Tableau – Tableau is a data visualization software that allows you to create interactive visualizations and dashboards.
16. NVivo – Nviva is a software tool for qualitative research and data analysis.
17. Slack – Slack is a messaging platform for team communication and collaboration.
18. Zoom –Zoom is a video conferencing software that allows you to conduct virtual meetings and webinars.
19. Microsoft Teams – Microsoft Teams is a collaboration platform that allows you to chat, share files, and collaborate with your team.
20. Qualtrics – Qualtrics is an online survey platform that allows researchers to design and distribute surveys, collect and analyze data, and generate reports.
Maximizing accuracy and efficiency with research tools
Research is a vital aspect of any academic discipline, and it is critical to have access to appropriate research tools to facilitate the research process. Researchers require access to various research tools and software to conduct research, analyze data, and report research findings. Some standard research tools researchers use include search engines, reference management software, statistical analysis tools, writing tools, and data visualization tools.
Specialized research tools are also available for researchers in specific fields, such as GIS software for geographers and geneticist gene sequence analysis tools. These tools help researchers organize data, collaborate with peers, and effectively present research findings.
It is crucial for researchers to choose the right tools for their research project, as these tools can significantly impact the accuracy and reliability of research findings.
Conclusion
Summing it up, researchers today have access to an array of essential research tools that can help simplify the research process. From data collection to analysis and presentation, these tools make research more accessible, efficient, and accurate. By leveraging these tools, researchers can improve their work and produce more high-quality research.
Get ahead in data analysis with our summary of the top 7 must-know statistical techniques. Master these tools for better insights and results.
While the field of statistical inference is fascinating, many people have a tough time grasping its subtleties. For example, some may not be aware that there are multiple types of inference and that each is applied in a different situation. Moreover, the applications to which inference can be applied are equally diverse.
For example, when it comes to assessing the credibility of a witness, we need to know how reliable the person is and how likely it is that the person is lying. Similarly, when it comes to making predictions about the future, it is important to factor in not just the accuracy of the forecast but also whether it is credible.
Top statistical techniques – Data Science Dojo
Counterfactual causal inference:
Counterfactual causal inference is a statistical technique that is used to evaluate the causal significance of historical events. Exploring how historical events may have unfolded under small changes in circumstances allows us to assess the importance of factors that may have caused the event. This technique can be used in a wide range of fields such as economics, history, and social sciences. There are multiple ways of doing counterfactual inference, such as Bayesian Structural Modelling.
Overparametrized models and regularization:
Overparametrized models are models that have more parameters than the number of observations. These models are prone to overfitting and are not generalizable to new data. Regularization is a technique that is used to combat overfitting in overparametrized models. Regularization adds a penalty term to the loss function to discourage the model from fitting the noise in the data. Two common types of regularization are L1 and L2 regularization.
Generic computation algorithms:
Generic computation algorithms are a set of algorithms that can be applied to a wide range of problems. These algorithms are often used to solve optimization problems, such as gradient descent and conjugate gradient. They are also used in machine learning, such as support vector machines and k-means clustering.
Robust inference:
Robust inference is a technique that is used to make inferences that are not sensitive to outliers or extreme observations. This technique is often used in cases where the data is contaminated with errors or outliers. There are several robust statistical methods such as the median and the Huber M-estimator.
Bootstrapping and simulation-based inference are techniques that are used to estimate the precision of sample statistics and to evaluate and compare models. Bootstrapping is a resampling technique that is used to estimate the sampling distribution of a statistic by resampling the data with replacement.
Simulation-based inference is a method that is used to estimate the sampling distribution of a statistic by generating many simulated samples from the model.
Multilevel models are a class of models that are used to account for the hierarchical structure of data. These models are often used in fields such as education, sociology, and epidemiology. They are also known as hierarchical linear models, mixed-effects models, or random coefficient models.
Adaptive decision analysis:
Adaptive Decision Analysis is a statistical technique that is used to make decisions under uncertainty. It involves modeling the decision problem, simulating the outcomes of the decision and updating the decision based on the new information. This method is often used in fields such as finance, engineering, and healthcare.
Which statistical techniques are most used by you?
This article discusses most of the statistical methods that are used in quantitative fields. These are often used to infer causal relationships between variables.
The primary goal of any statistical way is to infer causality from observational data. It is usually difficult to achieve this goal for two reasons. First, observational data may be noisy and contaminated by errors. Second, variables are often correlated. To correctly infer causality, it is necessary to model these correlations and to account for any biases and confounding factors.
As statistical techniques are often implemented using specific software packages, the implementations of each method often differ. This article first briefly describes the papers and software packages that are used in the following sections. It then describes the most common statistical techniques and the best practices that are associated with each technique.
In this blog, we are going to learn the differences and similarities between linear regression and logistic regression.
Regression is a statistical technique used in the fields of finance, investing, and other disciplines that aim to establish the nature and strength of the relationship between a single dependent variable (often represented by Y) and a number of independent variables (known as independent variables).
Linear regression vs logistic regression – Data Science Dojo
Forecasting and prediction both require regression analysis. There is a lot of overlap between this and machine learning. This statistical approach is employed in a variety of industries, including
Financial: Understanding stock price trends, making price predictions, and assessing insurance risk.
Marketing: Analyze the success of marketing initiatives and project product pricing and sales.
Manufacturing: Assess the relationships between the variables that define a better engine and its performance.
Medicine: To produce generic medications for ailments, forecast the various medication combinations.
The most popular variation of this method is linear regression, which is also known as simple regression or ordinary least squares (OLS). Based on a line of best fit, linear regression determines the linear relationship between two variables.
The slope of a straight line used to represent linear regression thus indicates how changing one variable effect changing another. In a linear regression connection, the value of one variable when the value of the other is zero is represented by the y-intercept. There are also non-linear regression models, although they are far more complicated.
Terminologies used in regression analysis
Outliers
The term “outlier” refers to an observation in a dataset that has an extremely high or very low value in comparison to the other observations, i.e., it does not belong to the population.
Multicollinearity
The independent variables are said to be multicollinear when there is a strong correlation between them.
Heteroscedasticity
Heteroscedasticity refers to the non-constant fluctuation between the target variable and the independent variable.
Both under- and over-fit
Overfitting may result from the usage of extraneous explanatory variables. Overfitting occurs when our algorithm performs admirably on the training set but falls short on the test sets.
Linear regression
In simple terms, linear regression is used to find a relationship between two variables: a Dependent variable (y) and an independent variable (X) with the help of a straight line. It also makes predictions for continuous or numeric variables such as sales, salary, age, and product price and shows us how the value of the dependent variable changes with the change in the value of an independent variable.
Let’s say we have a dataset available consisting of house areas in square meters and their respective prices.
As the change in area results in a change in price change of a house, we will put the area on the X-axis as an independent variable and the Price on the Y-axis as a dependent variable.
On the chart, these data points would appear as a scatter plot, a set of points that may or may not appear to be organized along any line.
Now using this data, we are required to predict the price of houses having the following areas:
500, 2000, and 3500.
After plotting these points, if a linear pattern is visible, sketch a straight line as the line of best fit.
The best fit line we draw minimizes the distance between it and the observed data. Estimating this line is a key component of regression analysis that helps to infer the relationships between a dependent variable and an independent variable.
Measures for linear regression
To understand the amount of error that exists between different models in linear regression, we use metrics. Let’s discuss some of the evaluation measures for regression:
Mean Absolute Error
Mean absolute error measures the absolute difference between the predicted and actual values of the model. This metric is the average prediction error. Lower MAE values indicate a better fit.
Root Mean Squared Error
Root Mean Squared Error indicates how different the residuals are from zero. Residuals represent the difference between the observed and predicted value of the dependent variable.
R-Squared Measure
R squared Measure is the standard deviation of the residuals. The plot of the residuals shows the distance of the data points from the regression line. The root mean squared error squares the residuals, average the residuals, and takes the square root. RMSE measures the difference between the actual target from the predicted values.
Lower RMSE values indicate shorter distances from the actual data point to the line and therefore a better fit. RMSE uses the same units as the dependent value.
Logistic regression
Additionally, logistic models can modify raw data streams to produce characteristics for various AI and machine learning methods. In reality, one of the often employed machine learning techniques for binary classification issues, or problems with two class values, includes logistic regression. These problems include predictions like “this or that,” “yes or no,” and “A or B.”
The probability of occurrences can also be estimated using logistic regression, which includes establishing a link between feature likelihood and outcome likelihood. In other words, it can be applied to categorization by building a model that links the number of hours of study to the likelihood that a student would pass or fail.
Comparison of linear regression and logistic regression
The primary distinction between logistic and linear regression is that the output of logistic regression is constant whereas the output of linear regression is continuousutilized.
The outcome, or dependent variable, in logistic regression has just two possible values. However, the output of a linear regression is continuous, which means that there are an endless number of possible values for it.
When the response variable is categorical, such as yes/no, true/false, and pass/fail, logistic regression is utilised. When the response variable is continuous, like hours, height, or weight, linear regression is utilised.
Logistic regression and linear regression, for instance, can predict various outcomes depending on the information about the amount of time a student spent studying and the results of their exams.
Curve, a visual representation of linear and logistic regression
Regression curves – Visual representation of linear regression and logistic regression
A straight line, often known as a regression line, is used to indicate linear regression. This line displays the expected score on “y” for each value of “x.” Additionally, the distance between the data points on the plot and the regression line reveals model flaws.
In contrast, an S-shaped curve is revealed using logistic regression. Here, the orientation and steepness of the curve are affected by changes in the regression coefficients. So, it follows that a positive slope yields an S-shaped curve, but a negative slope yields a Z-shaped curve.
Which one to use – Linear regression or logistic regression?
Regression analysis requires careful attention to the problem statement, which must be understood before proceeding. It seems sense to apply linear regression if the problem description mentions forecasts. If binary classification is included in the issue statement, logistic regression should be used. Similarly, we must assess each of our regression models in light of the problem statement.
Enroll in Data Science Bootcamp to learn more about these ideas and advance your career today.
Statistical distributions help us understand a problem better by assigning a range of possible values to the variables, making them very useful in data science and machine learning. Here are 6 types of distributions with intuitive examples that often occur in real-life data.
In statistics, a distribution is simply a way to understand how a set of data points are spread over some given range of values.
For example, distribution takes place when the merchant and the producer agree to sell the product during a specific time frame. This form of distribution is exhibited by the agreement reached between Apple and AT&T to distribute their products in the United States.
Types of probability distribution – Data Science Dojo
Types of statistical distributions
There are several statistical distributions, each representing different types of data and serving different purposes. Here we will cover several commonly used distributions.
A normal distribution also known as “Gaussian Distribution” shows the probability density for a population of continuous data (for example height in cm for all NBA players). Also, it indicates the likelihood that any NBA player will have a particular height. Let’s say fewer players are much taller or shorter than usual; most are close to average height.
The spread of the values in our population is measured using a metric called standard deviation. The Empirical Rule tells us that:
68.3% of the values will fall between1 standard deviation above and below the mean
95.5% of the values will fall between2 standard deviations above and below the mean
99.7% of the values will fall between3 standard deviations above and below the mean
Let’s assume that we know that the mean height of all players in the NBA is 200cm and the standard deviation is 7cm. If Le Bron James is 206 cm tall, what proportion of NBA players is he taller than? We can figure this out! LeBron is 6cm taller than the mean (206cm – 200cm). Since the standard deviation is 7cm, he is 0.86 standard deviations (6cm / 7cm) above the mean.
Our value of 0.86 standard deviations is called the z-score. This shows that James is taller than 80.5% of players in the NBA!
This can be converted to a percentile using the probability density function (or a look-up table) giving us our answer. A probability density function (PDF) defines the random variable’s probability of coming within a distinct range of values.
2. t-distribution
A t-distribution is symmetrical around the mean, like a normal distribution, and its breadth is determined by the variance of the data. A t-distribution is made for circumstances where the sample size is limited, but a normal distribution works with a population. With a smaller sample size, the t-distribution takes on a broader range to account for the increased level of uncertainty.
The number of degrees of freedom, which is determined by dividing the sample size by one, determines the curve of a t-distribution. The t-distribution tends to resemble a normal distribution as sample size and degrees of freedom increase because a bigger sample size increases our confidence in estimating the underlying population statistics.
For example, suppose we deal with the total number of apples sold by a shopkeeper in a month. In that case, we will use the normal distribution. Whereas, if we are dealing with the total amount of apples sold in a day, i.e., a smaller sample, we can use the t distribution.
3. Binomial distribution
A Binomial Distribution can look a lot like a normal distribution’s shape. The main difference is that instead of plotting continuous data, it plots a distribution of two possible discrete outcomes, for example, the results from flipping a coin. Imagine flipping a coin 10 times, and from those 10 flips, noting down how many were “Heads”. It could be any number between 1 and 10. Now imagine repeating that task 1,000 times.
If the coin, we are using is indeed fair (not biased to heads or tails) then the distribution of outcomes should start to look at the plot above. In the vast majority of cases, we get 4, 5, or 6 “heads” from each set of 10 flips, and the likelihood of getting more extreme results is much rarer!
4. Bernoulli distribution
The Bernoulli Distribution is a special case of Binomial Distribution. It considers only two possible outcomes, success, and failure, true or false. It’s a really simple distribution, but worth knowing! In the example below we’re looking at the probability of rolling a 6 with a standard die.
If we roll a die many, many times, we should end up with a probability of rolling a 6, 1 out of every 6 times (or 16.7%) and thus a probability of not rolling a 6, in other words rolling a 1,2,3,4 or 5, 5 times out of 6 (or 83.3%) of the time!
5. Discrete uniform distribution: All outcomes are equally likely
Uniform distribution is represented by the function U(a, b), where a and b represent the starting and ending values, respectively. Like a discrete uniform distribution, there is a continuous uniform distribution for continuous variables.
In statistics, uniform distribution refers to a statistical distribution in which all outcomes are equally likely. Consider rolling a six-sided die. You have an equal probability of obtaining all six numbers on your next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6, equaling a probability of 1/6, hence an example of a discrete uniform distribution.
As a result, the uniform distribution graph contains bars of equal height representing each outcome. In our example, the height is a probability of 1/6 (0.166667).
The drawbacks of this distribution are that it often provides us with no relevant information. Using our example of a rolling die, we get the expected value of 3.5, which gives us no accurate intuition since there is no such thing as half a number on a dice. Since all values are equally likely, it gives us no real predictive power.
It is a distribution in which all events are equally likely to occur. Below, we’re looking at the results from rolling a die many, many times. We’re looking at which number we got on each roll and tallying these up. If we roll the die enough times (and the die is fair) we should end up with a completely uniform probability where the chance of getting any outcome is exactly the same
6. Poisson distribution
A Poisson Distribution is a discrete distribution similar to the Binomial Distribution (in that we’re plotting the probability of whole numbered outcomes) Unlike the other distributions we have seen however, this one is not symmetrical – it is instead bounded between 0 and infinity.
For example, a cricket chirps two times in 7 seconds on average. We can use the Poisson distribution to determine the likelihood of it chirping five times in 15 seconds. A Poisson process is represented with the notation Po(λ), where λ represents the expected number of events that can take place in a period.
The expected value and variance of a Poisson process is λ. X represents the discrete random variable. A Poisson Distribution can be modeled using the following formula.
The Poisson distribution describes the number of events or outcomes that occur during some fixed interval. Most commonly this is a time interval like in our example below where we are plotting the distribution of sales per hour in a shop.
Conclusion:
Data is an essential component of the data exploration and model development process. We can adjust our Machine Learning models to best match the problem if we can identify the pattern in the data distribution, which reduces the time to get to an accurate outcome.
Indeed, specific Machine Learning models are built to perform best when certain distribution assumptions are met. Knowing which distributions, we’re dealing with may thus assist us in determining which models to apply.
The Monte Carlo method is a technique for solving complex problems using probability and random numbers. Through repeated random sampling, Monte Carlo calculates the probabilities of multiple possible outcomes occurring in an uncertain process.
Whenever you try to solve problems in the future, you make certain assumptions. For example, forecasting problems make certain assumptions like the cost of a particular item, the value of stocks, or electricity units used in the future. Since these problems try to predict an estimate of an unknown value based on historical data, there always exists inherent risk and uncertainty.
The Monte Carlo simulation allows us to see all the possible outcomes of our decisions and assess risk, consequently allowing for better decision-making under uncertainty.
This blog will walk through the famous Monty Hall problem, and how it can be solved using the Monte Carlo method using Python.
Monty Hall problem
In the Monty Hall problem, the TV show host Monty presents three doors to the participant. Behind one of the doors is a valuable prize like a car, while behind the others is a less valuable prize like a goat.
Consider yourself to be one of the participants in the show. You choose one out of the three doors. Before opening your chosen door, Monty opens another door behind which would be one of the goats. Now you are left with two doors, behind one could be the car, and behind the other would be the other goat.
Monty then gives you the option to either switch your answer to the other unopened door or stick to the original one.
Is it in your favor to switch your answer to the other door? Well, probability says it is!
Let’s see how:
Initially, there are three unopen doors in front of you. The probability of the car being behind any of these doors is 1/3.
Let’s say you decide to pick door #1 as the probability is the same (1/3) for each of these doors. In other words, the probability that the car is behind door #1 is 1/3, and the probability that it will be behind either door #2 or door #3 is 2/3.
Monty is aware of the prize behind each door. He chooses to open door #3 and reveal a goat. He then asks you if you would like to either switch to door #2 or stick with door #1.
To solve the problem, let’s switch to Python and apply the Monte Carlo simulation.
Solving with Python
Initialize the 3 prizes
Create python lists to store the probabilities after each game. We will play as many games as iterations input.
Monte Carlo simulation
Before starting the game, we randomize the prizes behind each door. One of the doors will have a car behind it, while the other two will have a goat each. When we play a large number of games, all possible permutations get covered of prize distributions, and door choices get covered.
Below is the code that decides if your choice was correct or not, and if switching would’ve been the correct move.
After playing each game, the winning probabilities are updated and stored in the lists. When all games have been played, we return the final values of each of the lists, i.e., winning by switching your choice and winning by sticking to your choice.
Get results
Enter your desired number of iterations (the higher the number, the more numbers of games will be played to approximate the probabilities). In the final step, plot your results.
After running the simulation 1000 times, the probability that we win by always switching is 67.7%, and the probability that we win by always sticking to our choice is 32.3%. In other words, you will win approximately 2/3 times if you switch your door, and only 1/3 times if you stick to the original door.
Therefore, according to the Monte Carlo simulation, we are confident that it works to our advantage to switch the door in this tricky game.
In this blog, we will introduce you to the highly rated data science statistics books on Amazon. As you read the blog, you will find 5 books for beginners and 5 books for advanced-level experts. We will discuss what’s covered in each book and how it helps you to scale up your data science career.
Advanced statistics books for data science
1. Naked Statistics: Stripping the Dread from the Data – By Charles Wheelan
The book unfolds the underlying impact of statistics on our everyday life. It walks the readers through the power of data behind the news.
Mr. Wheelan begins the book with the classic Monty Hall problem. It is a famous, seemingly paradoxical problem using Bayes’ theorem in conditional probability. Moving on, the book separates the important ideas from the arcane technical details that can get in the way. The second part of the book interprets the role of descriptive statistics in crafting a meaningful summary of the underlying phenomenon of data.
Wheelan highlights the Gini Index to show how it represents the income distribution of the nation’s residents and is mostly used to measure inequality. The later part of the book clarifies key concepts such as correlation, inference, and regression analysis explaining how data is being manipulated in order to tackle thorny questions. Wheelan’s concluding chapter is all about the amazing contribution that statistics will continue to make to solving the world’s most pressing problems, rather than a more reflective assessment of its strengths and weaknesses.
2. Bayesian Methods For Hackers – Probabilistic Programming and Bayesian Inference, By Cameron Davidson-Pilon
We mostly learn Bayesian inference through intensely complex mathematical analyses that are also supported by artificial examples. This book comprehends Bayesian inference through probabilistic programming with the powerful PyMC language and the closely related Python tools NumPy, SciPy, and Matplotlib.
Davidson-Pilon focused on improving learners’ understanding of the motivations, applications, and challenges in Bayesian statistics and probabilistic programming. Moreover, this book brings a much-needed introduction to Bayesian methods targeted at practitioners. Therefore, you can reap the most benefit from this book if you have a prior sound understanding of statistics. Knowing about prior and posterior probabilities will give an added advantage to the reader in building and training the first Bayesian model.
The second part of the book introduces the probabilistic programming library for Python through a series of detailed examples and intuitive explanations, with recent core developments and the popularity of the scientific stack in Python, PyMC is likely to become a core component soon enough. PyMC does have dependencies to run, namely NumPy and (optionally) SciPy. To not limit the user, the examples in this book will rely only on PyMC, NumPy, SciPy, and Matplotlib. This book is filled with examples, figures, and Python code that make it easy to get started solving actual problems.
3. Practical Statistics for Data Scientists – By Peter Bruce and Andrew Bruce
This book is most beneficial for readers that have some basic understanding of R programming language and statistics.
The authors penned the important concepts to teach practical statistics in data science and covered data structures, datasets, random sampling, regression, descriptive statistics, probability, statistical experiments, and machine learning. The code is available in both Python and R. If an example code is offered with this book, you may use it in your programs and documentation.
The book defines the first step in any data science project that is exploring the data or data exploration. Exploratory data analysis is a comparatively new area of statistics. Classical statistics focused almost exclusively on inference, a sometimes-complex set of procedures for drawing conclusions about large populations based on small samples.
To apply the statistical concepts covered in this book, unstructured raw data must be processed and manipulated into a structured form—as it might emerge from a relational database—or be collected for a study.
4. Advanced Engineering Mathematics by Erwin Kreyszig
Advanced Engineering Mathematics is a textbook for advanced engineering and applied mathematics students. The book deals with calculus of vector, tensor and differential equations, partial differential equations, linear elasticity, nonlinear dynamics, chaos theory and applications in engineering.
Advanced Engineering Mathematics is a textbook that focuses on the practical aspects of mathematics. It is an excellent book for those who are interested in learning about engineering and its role in society. The book is divided into five sections: Differential Equations, Integral Equations, Differential Mathematics, Calculus and Probability Theory. It also provides a basic introduction to linear algebra and matrix theory. This book can be used by students who want to study at the graduate level or for those who want to become engineers or scientists.
The text provides a self-contained introduction to advanced mathematical concepts and methods in applied mathematics. It covers topics such as integral calculus, partial differentiation, vector calculus and its applications to physics, Hamiltonian systems and their stability analysis, functional analysis, classical mechanics and its applications to engineering problems.
The book includes a large number of problems at the end of each chapter that helps students develop their understanding of the material covered in the chapter.
5. Computer Age Statistical Inference by Bradley Efron and Trevor Hastie
Computer Age Statistical Inference is a book aimed at data scientists who are looking to learn about the theory behind machine learning and statistical inference. The authors have taken a unique approach in this book, as they have not only introduced many different topics, but they have also included a few examples of how these ideas can be applied in practice.
The book starts off with an introduction to statistical inference and then progresses through chapters on linear regression models, logistic regression models, statistical model selection, and variable selection. There are several appendices that provide additional information on topics such as confidence intervals and variable importance. This book is great for anyone looking for an introduction to machine learning or statistics.
Computer Age Statistical Inference is a book that introduces students to the field of statistical inference in a modern computational setting. It covers topics such as Bayesian inference and nonparametric methods, which are essential for data science. In particular, this book focuses on Bayesian classification methods and their application to real world problems. It discusses how to develop models for continuous and discrete data, how to evaluate model performance, how to choose between parametric and nonparametric methods, how to incorporate prior distributions into your model, and much more.
5 Beginner level statistics books for data science
6. How to Lie with Statistics by Darrell Huff
How to Lie with Statistics is one of the most influential books about statistical inference. It was first published in 1954 and has been translated into many languages. The book describes how to use statistics to make your most important decisions, like whether to buy a house, how much money to give to charity, and what kind of mortgage you should take out. The book is intended for laymen, as it includes illustrations and some mathematical formulas. It’s full of interesting insights into how people can manipulate data to support their own agendas.
The book is still relevant today because it describes how people use statistics in their daily lives. It gives an understanding of the types of questions that are asked and how they are answered by statistical methods. The book also explains why some results seem more reliable than others.
The first half of the book discusses methods of making statistical claims (including how to make improper ones) and illustrates these using examples from real life. The second half provides a detailed explanation of the mathematics behind probability theory and statistics.
A common criticism of the book is that it focuses too much on what statisticians do rather than why they do it. This is true — but that’s part of its appeal!
7. Head-first Statistics: A Brain-Friendly Guide Book by Dawn Griffiths
If you are looking for a book that will help you understand the basics of statistics, then this is the perfect book for you. In this book, you will learn how to use data and make informed decisions based on your findings. You will also learn how to analyze data and draw conclusions from it.
This book is ideal for those who have already completed a course in statistics or have studied it in college. Griffiths has given an overview of the different types of statistical tests used in everyday life and provides examples of how to use them effectively.
The book starts off with an explanation of statistics, which includes topics such as sampling, probability, population and sample size, normal distribution and variation, confidence intervals, tests of hypotheses and correlation.
After this section, the book goes into more advanced topics such as regression analysis, hypothesis testing etc. There are also some chapters on data mining techniques like clustering and classification etc.
The author has explained each topic in detail for the readers who have little knowledge about statistics so they can follow along easily. The language used throughout this book is very clear and simple which makes it easy to understand even for beginners.
8. Think Stats By Allen B. Downey
Think Stats is a great book for students who want to learn more about statistics. The author, Allen Downey, uses simple examples and diagrams to explain the concepts behind each topic. This book is especially helpful for those who are new to mathematics or statistics because it is written in an easy-to-understand manner that even those with a high school degree can understand.
The book begins with an introduction to basic counting, addition, subtraction, multiplication and division. It then moves on to finding averages and making predictions about what will happen if one number changes. It also covers topics like randomness, sampling techniques, sampling distributions and probability theory.
The author uses real-world examples throughout the book so that readers can see how these concepts apply in their own lives. He also includes exercises at the end of each chapter so that readers can practice what they’ve learned before moving on to the next section of the book. This makes Think Stats an excellent resource for anyone looking for tips on improving their math skills or just wanting to brush up on some statistical basics!
9. An Introduction To Statistical Learning With Applications In R By Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Statistical learning with applications in R is a guide to advanced statistical learning. It introduces modern machine learning techniques and their applications, including sequential decision-making, Gaussian mixture models, boosting, and genetic programming. The book covers methods for supervised and unsupervised learning, as well as neural networks. The book also includes chapters on Bayesian statistics and deep learning.
It begins with a discussion of correlation and regression analysis, followed by Bayesian inference using Markov chain Monte Carlo methods. The authors then discuss regularization techniques for regression models and introduce boosting algorithms. This section concludes with an overview of neural networks and convolutional neural networks (CNNs). The remainder of the book deals with topics such as kernel methods, support vector machines (SVMs), regression trees (RTs), naive Bayes classifiers, Gaussian processes (GP), gradient ascent methods, and more.
This statistics book is recommended to researchers willing to learn about statistical machine learning but do not have the necessary expertise in mathematics or programming languages
10. Statistics in Plain English By Timothy C. Urdan
Statistics in Plain English is a writing guide for students of statistics. Timothy in his book covered basic concepts with examples and guidance for using statistical techniques in the real world. The book includes a glossary of terms, exercises (with solutions), and web resources.
The book begins by explaining the difference between descriptive statistics and inferential statistics, which are used to draw conclusions about data. It then covers basic vocabulary such as mean, median, mode, standard deviation, and range.
In Chapter 2, the author explains how to calculate sample sizes that are large enough to make accurate estimates. In Chapters 3–5 he gives examples of how to use various kinds of data: census data on population density; survey data on attitudes toward various products; weather reports on temperature fluctuations; and sports scores from games played by teams over time periods ranging from minutes to seasons. He also shows how to use these data to estimate the parameters for models that explain behavior in these situations.
The last 3 chapters define the use of frequency distributions to answer questions about probability distributions such as whether there’s a significant difference between two possible outcomes or whether there’s a trend in a set of numbers over time or space
Which data science statistics books are you planning to get?
Build upon your statistical concepts and successfully step into the world of data science. Analyze your knowledge and choose the most suitable book for your career to enhance your data science skills. If you have any more suggestions for statistics books for data science, please share them with us in the comments below.
Learn how logistic regression fits a dataset to make predictions in R, as well as when and why to use it.
Logistic regression is one of the statistical techniques in machine learning used to form prediction models. It is one of the most popular classification algorithms mostly used for binary classification problems (problems with two class values, however, some variants may deal with multiple classes as well). It’s used for various research and industrial problems.
Therefore, it is essential to have a good grasp of logistic regression algorithms while learning data science. This tutorial is a sneak peek from many of Data Science Dojo’s hands-on exercises from their data science Bootcamp program, you will learn how logistic regression fits a dataset to make predictions, as well as when and why to use it.
In short, Logistic Regression is used when the dependent variable(target) is categorical. For example:
To predict whether an email is spam (1) or not spam (0)
Whether the tumor is malignant (1) or not (0)
Intro to logistic regression
It is named ‘Logistic Regression’ because its underlying technology is quite the same as Linear Regression. There are structural differences in how linear and logistic regression operate. Therefore, linear regression isn’t suitable to be used for classification problems. This link answers in detail why linear regression isn’t the right approach for classification.
Its name is derived from one of the core functions behind its implementation called the logistic function or the sigmoid function. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.
The hypothesis function of logistic regression can be seen below where the function g(z) is also shown.
The hypothesis for logistic regression now becomes:
Here θ (theta) is a vector of parameters that our model will calculate to fit our classifier.
After calculations from the above equations, the cost function is now as follows:
Here m is several training examples. Like Linear Regression, we will use gradient descent to minimize our cost function and calculate the vector θ (theta).
This tutorial will follow the format below to provide you with hands-on practice with Logistic Regression:
Importing Libraries
Importing Datasets
Exploratory Data Analysis
Feature Engineering
Pre-processing
Model Development
Prediction
Evaluation
The scenario
In this tutorial, we will be working with the Default of Credit Card Clients Data Set. This data set has 30000 rows and 24 columns. The data set could be used to estimate the probability of default payment by credit card clients using the data provided. These attributes are related to various details about a customer, his past payment information, and bill statements. It is hosted in Data Science Dojo’s repository.
Think of yourself as a lead data scientist employed at a large bank. You have been assigned to predict whether a particular customer will default on their payment next month or not. The result is an extremely valuable piece of information for the bank to take decisions regarding offering credit to its customers and could massively affect the bank’s revenue. Therefore, your task is very critical. You will learn to use logistic regression to solve this problem.
The dataset is a tricky one as it has a mix of categorical and continuous variables. Moreover, you will also get a chance to practice these concepts through short assignments given at the end of a few sub-modules. Feel free to change the parameters in the given methods once you have been through the entire notebook.
We’ll begin by importing the dependencies that we require. The following dependencies are popularly used for data wrangling operations and visualizations. We would encourage you to have a look at their documentation.
library(knitr)
library(tidyverse)
library(ggplot2)
library(mice)
library(lattice)
library(reshape2)
#install.packages("DataExplorer") if the following package is not available
library(DataExplorer)
2) Importing Datasets
The dataset is available at Data Science Dojo’s repository in the following link. We’ll use the head method to view the first few rows.
## Need to fetch the excel file
path <- "https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/
Default%20of%20Credit%20Card%20Clients/default%20of%20credit%20card%20clients.csv"
data <- read.csv(file = path, header = TRUE)
head(data)
Since the header names are in the first row of the dataset, we’ll use the code below to first assign the headers to be the one from the first row and then delete the first row from the dataset. This way we will get our desired form.
colnames(data) <- as.character(unlist(data[1,]))
data = data[-1, ]
head(data)
To avoid any complications ahead, we’ll rename our target variable “default payment next month” to a name without spaces using the code below.
colnames(data)[colnames(data)=="default payment next month"] <- "default_payment"
head(data)
3) Exploratory data analysis
Data Exploration is one of the most significant portions of the machine-learning process. Clean data can ensure a notable increase in the accuracy of our model. No matter how powerful our model is, it cannot function well unless the data we provide has been thoroughly processed.
This step will briefly take you through this step and assist you to visualize your data, find the relation between variables, deal with missing values and outliers and assist in getting some fundamental understanding of each variable we’ll use. Moreover, this step will also enable us to figure out the most important attributes to feed our model and discard those that have no relevance.
We will start by using the dim function to print out the dimensionality of our data frame.
dim(data)
30000 25
The str method will allow us to know the data type of each variable. We’ll transform it to a numeric data type since it’ll be handier to use for our functions ahead.
We have involved an intermediate step by converting our data to character first. We need to use as.character before as.numeric. This is because factors are stored internally as integers with a table to give the factor level labels. Just using as.numeric will only give the internal integer codes.
When applied to a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together. For a continuous (numeric) variable like “age”, it returns the 5-number summary showing 5 descriptive statistic as these are numeric values.
summary(data)
ID LIMIT_BAL SEX EDUCATION
Min. : 1 Min. : 10000 Min. :1.000 Min. :0.000
1st Qu.: 7501 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000
Median :15000 Median : 140000 Median :2.000 Median :2.000
Mean :15000 Mean : 167484 Mean :1.604 Mean :1.853
3rd Qu.:22500 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000
Max. :30000 Max. :1000000 Max. :2.000 Max. :6.000
MARRIAGE AGE PAY_0 PAY_2
Min. :0.000 Min. :21.00 Min. :-2.0000 Min. :-2.0000
1st Qu.:1.000 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000
Median :2.000 Median :34.00 Median : 0.0000 Median : 0.0000
Mean :1.552 Mean :35.49 Mean :-0.0167 Mean :-0.1338
3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. :3.000 Max. :79.00 Max. : 8.0000 Max. : 8.0000
PAY_3 PAY_4 PAY_5 PAY_6
Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean :-0.1662 Mean :-0.2207 Mean :-0.2662 Mean :-0.2911
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4
Min. :-165580 Min. :-69777 Min. :-157264 Min. :-170000
1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327
Median : 22382 Median : 21200 Median : 20089 Median : 19052
Mean : 51223 Mean : 49179 Mean : 47013 Mean : 43263
3rd Qu.: 67091 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506
Max. : 964511 Max. :983931 Max. :1664089 Max. : 891586
BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
Min. :-81334 Min. :-339603 Min. : 0 Min. : 0
1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833
Median : 18105 Median : 17071 Median : 2100 Median : 2009
Mean : 40311 Mean : 38872 Mean : 5664 Mean : 5921
3rd Qu.: 50191 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000
Max. :927171 Max. : 961664 Max. :873552 Max. :1684259
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8
Median : 1800 Median : 1500 Median : 1500.0 Median : 1500.0
Mean : 5226 Mean : 4826 Mean : 4799.4 Mean : 5215.5
3rd Qu.: 4505 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0
Max. :896040 Max. :621000 Max. :426529.0 Max. :528666.0
default_payment
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.2212
3rd Qu.:0.0000
Max. :1.0000
Using the introduced method, we can get to know the basic information about the dataframe, including the number of missing values in each variable.
introduce(data)
As we can observe, there are no missing values in the dataframe.
The information in summary above gives a sense of the continuous and categorical features in our dataset. However, evaluating these details against the data description shows that categorical values such as EDUCATION and MARRIAGE have categories beyond those given in the data dictionary. We’ll find out these extra categories using the value_counts method.
count(data, vars = EDUCATION)
vars
n
0
14
1
10585
2
14030
3
4917
4
123
5
280
6
51
The data dictionary defines the following categories for EDUCATION: “Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)”. However, we can also observe 0 along with numbers greater than 4, i.e. 5 and 6. Since we don’t have any further details about it, we can assume 0 to be someone with no educational experience and 0 along with 5 & 6 can be placed in others along with 4.
count(data, vars = MARRIAGE)
vars
n
0
54
1
13659
2
15964
3
323
The data dictionary defines the following categories for MARRIAGE: “Marital status (1 = married; 2 = single; 3 = others)”. Since category 0 hasn’t been defined anywhere in the data dictionary, we can include it in the ‘others’ category marked as 3.
count(data, vars = MARRIAGE)
count(data, vars = EDUCATION)
vars
n
1
13659
2
15964
3
377
vars
n
1
10585
2
14030
3
4917
4
468
We’ll now move on to a multi-variate analysis of our variables and draw a correlation heat map from the DataExplorer library. The heatmap will enable us to find out the correlation between each variable. We are more interested in finding out the correlation between our predictor attributes with the target attribute default payment next month. The color scheme depicts the strength of correlation between 2 variables.
This will be a simple way to quickly find out how much an impact a variable has on our final outcome. There are other ways as well to figure this out.
plot_correlation(na.omit(data), maxcat = 5L)
We can observe the week correlation of AGE, BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6 with our target variable.
Now let’s have a univariate analysis of our variables. We’ll start with the categorical variables and have a quick check on the frequency of distribution of categories. The code below will allow us to observe the required graphs. We’ll first draw distribution for all PAY variables.
plot_histogram(data)
We can make a few observations from the above histogram. The distribution above shows that all nearly all PAY attributes are rightly skewed.
4) Feature engineering
This step can be more important than the actual model used because a machine learning algorithm only learns from the data we give it, and creating features that are relevant to a task is absolutely crucial.
Analyzing our data above, we’ve been able to note the extremely week correlation of some variables with the final target variable. The following are the ones which have significantly low correlation values: AGE, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6.
Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one.
It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
In the code below, we’ll use the scale method transform our dataset using it.
data_new[, 1:17] <- scale(data_new[, 1:17])
head(data_new)
The next task we’ll do is to split the data for training and testing as we’ll use our test data to evaluate our model. We will now split our dataset into train and test. We’ll change it to 0.3. Therefore, 30% of the dataset is reserved for testing while the remaining for training. By default, the dataset will also be shuffled before splitting.
#create a list of random number ranging from 1 to number of rows from actual data
#and 70% of the data into training data
data2 = sort(sample(nrow(data_new), nrow(data_new)*.7))
#creating training data set by selecting the output row values
train <- data_new[data2,]
#creating test data set by not selecting the output row values
test <- data_new[-data2,]
Let us print the dimensions of all these variables using the dim method. You can notice the 70-30% split.
dim(train)
dim(test)
21000 18
9000 18
6) Model development
We will now move on to our most important step of developing our logistic regression model. We have already fetched our machine learning model in the beginning. Now with a few lines of code, we’ll first create a logistic regression model which as has been imported from sci-kit learn’s linear model package to our variable named model.
Followed by this, we’ll train our model using the fit method with X_train and y_train which contain 70% of our dataset. This will be a binary classification model.
## fit a logistic regression model with the training dataset
log.model <- glm(default_payment ~., data = train, family = binomial(link = "logit"))
summary(log.model)
Call:
glm(formula = default_payment ~ ., family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1171 -0.6998 -0.5473 -0.2946 3.4915
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.465097 0.019825 -73.900 < 2e-16 ***
LIMIT_BAL -0.083475 0.023905 -3.492 0.000480 ***
SEX -0.082986 0.017717 -4.684 2.81e-06 ***
EDUCATION -0.059851 0.019178 -3.121 0.001803 **
MARRIAGE -0.107322 0.018350 -5.849 4.95e-09 ***
PAY_0 0.661918 0.023605 28.041 < 2e-16 ***
PAY_2 0.069704 0.028842 2.417 0.015660 *
PAY_3 0.090691 0.031982 2.836 0.004573 **
PAY_4 0.074336 0.034612 2.148 0.031738 *
PAY_5 0.018469 0.036430 0.507 0.612178
PAY_6 0.006314 0.030235 0.209 0.834584
BILL_AMT1 -0.123582 0.023558 -5.246 1.56e-07 ***
PAY_AMT1 -0.136745 0.037549 -3.642 0.000271 ***
PAY_AMT2 -0.246634 0.056432 -4.370 1.24e-05 ***
PAY_AMT3 -0.014662 0.028012 -0.523 0.600677
PAY_AMT4 -0.087782 0.031484 -2.788 0.005300 **
PAY_AMT5 -0.084533 0.030917 -2.734 0.006254 **
PAY_AMT6 -0.027355 0.025707 -1.064 0.287277
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 22176 on 20999 degrees of freedom
Residual deviance: 19535 on 20982 degrees of freedom
AIC: 19571
Number of Fisher Scoring iterations: 6
7) Prediction
Below we’ll use the prediction method to find out the predictions made by our Logistic Regression method. We will first store the predicted results in our y_pred variable and print our first 10 rows of our test data set. Following this we will print the predicted values of the corresponding rows and the original labels that were stored in y_test for comparison.
test[1:10,]
## to predict using logistic regression model, probablilities obtained
log.predictions <- predict(log.model, test, type="response")
## Look at probability output
head(log.predictions, 10)
2
0.539623162720197
7
0.232835137994762
10
0.25988780274953
11
0.0556716133560243
15
0.422481223473459
22
0.165384552048511
25
0.0494775267027534
26
0.238225423596718
31
0.248366972046479
37
0.111907725985513
Below we are going to assign our labels with decision rule that if the prediction is greater than 0.5, assign it 1 else 0.
We’ll now discuss a few evaluation metrics to measure the performance of our machine learning model here. This part has significant relevance since it will allow us to understand the most important characteristics that led to our model development.
We will output the confusion matrix. It is a handy presentation of the accuracy of a model with two or more classes.
The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.
According to an article the entries in the confusion matrix have the following meaning in the context of our study:
[[a b][c d]]
a is the number of correct predictions that an instance is negative,
b is the number of incorrect predictions that an instance is positive,
c is the number of incorrect of predictions that an instance is negative, and
d is the number of correct predictions that an instance is positive.
table(log.prediction.rd, test[,18])
log.prediction.rd 0 1
0 6832 1517
1 170 481
We’ll write a simple function to print the accuracy below
This tutorial has given you a brief and concise overview of the Logistic Regression algorithm and all the steps involved in achieving better results from our model. This notebook has also highlighted a few methods related to Exploratory Data Analysis, Pre-processing, and Evaluation, however, there are several other methods that we would encourage to explore on our blog or video tutorials.
If you want to take a deeper dive into several data science techniques. Join our 5-day hands-on Data Science Bootcamp preferred by working professionals, we cover the following topics:
Fundamentals of Data Mining
Machine Learning Fundamentals
Introduction to R
Introduction to Azure Machine Learning Studio
Data Exploration, Visualization, and Feature Engineering
Decision Tree Learning
Ensemble Methods: Bagging, Boosting, and Random Forest
Every cook knows how to avoid Type I Error: just remove the batteries. Let’s also learn how to reduce the chances of Type II errors.
Why type I and type II errors matter
A/B testing is an essential component of large-scale online services today. So essential, that every online business worth mentioning has been doing it for the last 10 years.
A/B testing is also used in email marketing by all major online retailers. The Obama for America data science team received a lot of press coverage for leveraging data science, especially A/B testing during the presidential campaign.
Hypothesis testing outcome – Data Science Dojo
Here is an interesting article on this topic along with a data science bootcamp that teaches a/b testing and statistical analysis.
If you have been involved in anything related to A/B testing (online experimentation) on UI, relevance or email marketing, chances are that you have heard of Type i and Type ii error. The usage of these terms is common but a good understanding of them is not.
I intend to share two great examples I recently read that will help you remember this especially important concept in hypothesis testing.
Type I error: An alarm without a fire.
Type II error: A fire without an alarm.
Every cook knows how to avoid Type I Error – just remove the batteries. Unfortunately, this increases the incidences of Type II error.
Reducing the chances of Type II error would mean making the alarm hypersensitive, which in turn would increase the chances of Type I error.
Another way to remember this is by recalling the story of the Boy Who Cried Wolf.
Null hypothesis testing: There is no wolf.
Alternative hypothesis testing: There is a wolf.
Villagers believing the boy when there was no wolf (Reject the null hypothesis incorrectly): Type 1 Error. Villagers not believing the boy when there was a wolf (Rejecting alternative hypothesis incorrectly): Type 2 Error
Tailpiece
The purpose of the post is not to explain type 1 and type 2 error. If this is the first time you are hearing about these terms, here is the Wikipedia entry: Type I and Type II Error.
During political campaigns, political candidates not only need votes but also need financial contributions. Here’s what the campaign finance data set from 2012 looks like.
Understanding individual political contribution by the occupation of top 1% vs bottom 99% in political campaigns
A political candidate not only needs votes, but they also need money. In today’s multi-media world millions of dollars are necessary to effectively run campaign finances. To win the election battle citizens will be bombarded with ads that cost millions. Other mounting expenses include wages for staff, consultants, surveyors, grassroots activists, media experts, wonks, and policy analysts. The figures are staggering with the next presidential election year campaigns likely to cost more than ten billion dollars.
The total cost of US elections from the year 1998 to 12014
Opensecrets.org has summarized the money spent by presidential candidates, Senate and House candidates, political parties, and independent interest groups that played an influential role in the federal elections by cycle. There’s no sign of less spending in future elections.
The 2016 presidential election cycle is already underway, the fund-raising war has already begun. Koch brothers’ political organization released an $889 million budget in January 2015 supporting conservative campaigns in the 2016 presidential contest. As for primary presidential candidates, the Hillary Clinton Campaign aims to raise at least $100 million for the primary election. On the other side of the political aisle, analysts speculated primary candidate Jeb Bush will raise over $100 million when he discloses his financial position in July.
In my mind I imagine that money coming from millionaires and billionaires or mega-corporations intent on promoting candidates that favor their cause. But who are these people? And how about middle-class citizens like me? Does my paltry $200 amount to anything? Does the spending power of the 99% have any impact on the outcome of an election? Even as a novice I knew I would never understand American politics by listening to TV talking heads or the candidates and their say-nothing ads but by following the money.
By investigating real data about where the stream of money dominating our elections comes from and the role it plays in the success of an election, I hope to find some insight into all the political noise. Thanks to the Federal Election Campaign Act, which requires candidate committees, party committees, and political action committees (PACs) to disclose reports on the money they raise and spend and identify individuals who give more than $200 in an election cycle, a wealth of public data exists to explore. I choose to focus on individual contributions to federal committees greater than $200 for the election cycle 2011-2012.
In the 2012 election cycle, which includes congressional and primary elections, the total amount of individual donations collected was USD 784 million. USD 220 million came from the top 1% of donors, which made up 28% of the total contribution. These elites’ wealthy donors were 7119 individuals, each having donated at least USD 10,000 to federal committees. So, who are the top 1%? What do they do for a living that gave them such financial power to support political committees?
The unique occupation titles from the dataset are simply overwhelming and difficult to construct appropriate analysis. Thus, these occupations were classified into twenty-two occupation groups according to the employment definition from the Bureau of Statistics. Additional categories were created due to a lack of definition to classify them into appropriate groups. Among them are “Retired,” “Unemployed,” “Homemaker,” and “Politicians.”
Immediately from Figure 1, we observe the “Management” occupation group contributed the highest total amount in the 2012 cycle for Democrats, Republicans, and Other parties, respectively. Other top donors by occupation groups are “Business and Financial Operations,” “Retired,” “Homemaker,” “Politicians,” and “Legal.” Overall, the Republicans Parties received a more individual contribution from most of the occupation groups, with noticeable exception from “Legal” and “Arts, Design, Entertainment, Sports and Media.” The total contribution given to “Other” non-Democratic/Republican was abysmal in comparison.
Figure 1: Total contribution of top 1% by occupation group
Total USD of the top 1% contributors by occupational group
One might conclude that the reason for the “Management” group being the top donor is obvious given these people are CEOs, CFOs, Presidents, Directors, Managers, and many other management titles in a company. According to the Bureau of Statistics, the “Management” group earned the highest median wages among all other occupation groups. They simply had more to give. The same argument could be applied to the “Business and Financial Operations” group, which is comprised of people who held jobs as investors, business owners, real estate developers, bankers, etc.
Perhaps we could look at the individual contribution by occupation group from another angle. When analyzing the average contribution by occupation group, the “Politicians” group became top of the chart. Individuals belonging to this category are either currently holding public office or they had declared candidacy for office with no other occupation reported. Since there is no limit on how much candidates may contribute to their committee, this group represents rich individuals funding their campaigns.
Figure 2: Average contribution of Top 1% by occupation groups
Suspiciously, the average amount per politician given to Republican committees is dramatically higher than other parties. Further analysis indicated that the outlier is candidate John Huntsman, who donated about USD 5 million to his committee, “Jon Huntsman for President Inc.” This has inflated the average contribution dramatically. The same phenomenon was also observed among the “Management” group, where the average contribution to the “Other” party was significantly higher compared to traditional parties.
Out of the five donors who contributed to an independent party from the “Management” group, William Bloomfield alone donated USD 1.3 million (out of the USD 1.45 million total amount collected) to his “Bloomfield for Congress” committee. According to the data, he was the Chairman of Baron Real Estate. This is an example of a wealthy elite spending a hefty sum of money to buy his way into the election race.
Donald Trump, a billionaire business mogul made headlines recently by declaring his intention to run for presidency 2016 election. He certainly has no trouble paying for his campaign. After excluding the occupation groups “Politicians” and “Management,” with the intention to visualize the comparison among groups more clearly, the contrast became less dramatic. No doubt, the average contribution to Republicans Committees is consistently higher than other parties in most of the occupation groups.
Figure 3: Average contribution of Top 1% by occupation group excluding politicians and management group
Could a similar story of the top 1% be told for the bottom 99%? Overall, the top 5 contributors by occupation group are quite similar between the top 1% and bottom 99%. Once again “Management” group collectively raised the most amount of donations to the Democrats and Republicans Parties. The biggest difference here is that “Politicians” is no longer the top contributor in the bottom 99% demographic.
Figure 4: Total contribution of bottom 99% by Occupation Group
Homemakers consistently rank high in both total contributions as well as average contribution, in both the top 1% and bottom 99%. On average, homemakers from the bottom 99% donated about $1500 meanwhile homemakers from the top 1% donated about $30,000 to their chosen political committees. Clearly across all levels of socioeconomic status spouses and stay-at-home parents play a key role in the fundraising war. Since the term “Homemaker” is not well-defined, I can only assume their source of money comes from a spouse, inherited wealth, or personal savings.
Figure 5: Average contribution of bottom 99% by occupation group
Another observation we could draw from the average contribution from the 99% plot is that “Other” non-Democrats/Republicans Parties depend heavily on the 99% as a source of funding for their political campaigns. Third-party candidates appear to be drawing most of their support from the little guy.
Figure 6: Median wages and median contribution by occupation group
Median wages and median contribution
Another interesting question warranting further investigation is how the amount individuals contributed to political committees is proportionately consistent across occupation groups. When we plotted median wages per occupation group side by side with median political contribution, the median of donation per group is rather constant while the median income varies significantly across groups. This implies that despite contributing the most overall, as a percentage of their income the wealthiest donors contributed the least.
Campaign finance data: The takeaway
The take-home message from this analysis is that the top 1% wealthy elite seems to be driving the momentum of fundraising for the election campaign. I suspect most of them have full intention to support candidates who would look out for their interest if indeed they got elected. We middle-class citizens may not have the ability to compete financially with these millionaires and billionaires, but our single vote is as powerful as their vote. The best thing we could do as citizens is to educate ourselves on issues that matter to the future of our country.
Ethics in research and A/B testing is essential. A/B testing might not be as simple and harmless as it looks. Learn how to take care of ethical concerns in A/B tests.
The ethical way to A/B testing
We have come a long way since the days of horrific human experiments during World Wars, the Stanford prison experiment, the Guatemalan STD Study, and many more where inhumane treatments were all in the name of science. However, we still have much to learn, with incidents like the clinical trial disaster in France and Facebook’s emotional and psychological experiments of recent years violating the rights of persons and serving as a clear reminder to constantly keep our ethics in research sharply upfront.
As data scientists, we are always experimenting – not only with our models or formulas but also with the responses from our customers. A/B tests or randomized experiments may require human subjects, who are willing to undertake a trial or treatments such as seeing certain content when using a Web app, or undergoing a certain exercise regime.
Performing A/B test on a website
Facebook example
What may initially seem like a harmless experiment, might cause harm or distress. For example, Facebook’s experiment of provoking negative emotions from some users and positive emotions from others could have grave consequences. If a user, who was experiencing emotional distress happened to have seen content which provoked negative feelings, it could spur on a tragic event such as physical harm. Careful understanding of our experiments and our test subjects may prevent inappropriate testing required prior to implementing our research, or products and services. Consent is the best tool to assist data scientists working with data generated by people. Similarly, to guidelines for clinical trials, it’s informed consent specifically that is needed to avoid potential unintended consequences of experiments.
If an organization specializing in exercise science accepted participation from a person who has a high risk of heart failure and did not ask for a medical examination before conducting the experiment, then the organization is potentially liable for the consequences.
Often a simple, harmless A/B test might not be as simple and harmless as it looks. So how do we ensure we are not putting our human subjects’ well-being and safety in danger when we conduct our research and experiments?
First steps in research
The first port of call is using informed user consent. This doesn’t mean pages and pages of legal jargon on sign up or being vague in an email when reaching out for volunteers for your study. This could rather be a popup window or email that is clear on the purpose of the experiment and any warnings or potential risks the person needs to be aware of. Depending on how intense the treatment is, a medical or psychological examination is a good idea to ensure that the participant can cope with the given treatment. Being unaware of people’s vulnerabilities can lead to unintended consequences. This can be avoided through clearer warnings, or the next level up which may be online assessments, or even expert examinations.
The next step in ensuring your A/B test or experiment runs smoothly and ethically is making sure you understand local and federal regulations around conducting research experiments on humans. In the US, these regulations have been outlined above. The regulations mainly look at:
Informed consent, with a full explanation of any potential risks to the subject.
Providing additional safeguards for vulnerable populations such as children, mentally disabled people, mentally ill people, economically disadvantaged people, pregnant women, and so on.
Government-funded experiments need the approval of Institutional Review Board or an independent ethics committee before conducting experiments.
During the A/B test or experiment, it’s also a good idea to regularly check-in and see how your subjects are responding to the treatments, not only for the purpose of scientific research, but also to quickly solve any health or well-being issues. This could be in the form of a short popup survey or email to check if the user is safe and well, or face to face consulting. Also, having an opt-out option allows the subject to take control if they feel their health or well-being is at risk. Having some people opt-out might seem inconvenient for your study, but a serious or tragic incident as a result of a participant having to go through the full course of the treatment is a far worse outcome.
Observational studies might be a good alternative if the above steps are in no way feasible for your experiment. Observational studies are limited when making conclusions, and only real experiments allow you to make confident conclusions from the data. However, in some situations, it is not possible nor ethical to force treatments onto subjects. For example, it’s not ethical to inject cancer cells into random subjects, but you can study cancer patients with the inherited attributes you are looking for to help with your research.
The ethical takeaway
It is understood that there can be some overhead in carefully preparing, setting up and following ethical guidelines for an experiment or A/B test. However, the serious consequences of not doing it properly, as well public distrust, will only lead to a reluctance to share data, hindering our ability to effectively do our work.
If you’re curious to learn more about A/B testing, watch the short video below.
Just like humans, algorithms can develop bias and make skewed decisions. What are these biases and how do they impact decision-making?
An algorithmic bias in making
If we took a hard look at every model ever built for classifying who is the optimal candidate for:
A credit loans
A job promotion
A free scholarship or
Any other opportunity,
would we see a pattern in certain groups of people being granted these opportunities over others? Are our algorithms and formulas biased?
Understanding the problem
Would we see these models repeatedly make decisions about who should be the part of the “have” and “have not” groups? Further, do these models truly pick the optimal candidate? Instead, might they pick according to what someone personally thinks is the optimal candidate?
Research groups like AI Now, recently launched the initiative to fight algorithmic bias. As a result, it’s bringing the issues to light. It’s crucial that we as data scientists keep our algorithms in check. This is to avoid developing yet another tool that is used to discriminate against people.
So how can we keep our algorithms in check?
In recent years, researchers have come up with ways to detect if a model is biased in its decisions about people. A 2016 paper called Equality of Opportunity in Supervised Learning proposes a framework.
This framework uses “equalized odds and equal opportunity” as a criterion for assessing a model’s fairness when classifying people. This criterion allows features to predict an outcome or class (such as predicting “high credit risk applicant”).
Importantly, it prohibits abusing a particular attribute of a person (such as race) to do this. The model must be equally accurate in all demographics. Consequently, it is punished if it only performs well on the majority of people. This means that the predicted outcome must have equal true positive/negative rates and false positive/negative rates across all demographics.
The framework is conducted as a post-learning step. Therefore, it doesn’t require modifying the algorithm or model itself. Then it assesses whether the results from a model seem skewed towards a group of people. For example, a flawed model is one that makes it harder for African Americans who do pay back their loans to apply for loans.
This model makes it easier for Caucasians who don’t pay pack their loans to apply for loans. Therefore, this framework ensures that this kind of model would be determined as unfair, as it would not result in equal false positive/negative rates for both African Americans and Caucasians.
The framework also overcomes the problem of loss of utility when using demographic parity. This requires a predicted outcome to be independent of a particular sensitive attribute. Using the framework, the predicted outcome is allowed to depend on a particular attribute, but only through the actual outcome. This prevents the attribute from being a proxy to the actual outcome while avoiding loss of utility.
Predictor variables and skewed data
Another framework for detecting algorithmic bias is testing how different predictor variables or attributes might skew the predicted outcome. A 2017 paper called Counterfactual Fairness shows how different variables influenced the results of the 2014 stop-and-frisk New York City police initiative.
The data showed that the police officers mostly stopped and frisked African Americans and Hispanics. This happened despite most of those people being innocent or not as suspicious as predicted.
Subsequently, actual incidents of crime were in fact similar across all races. When considering all predictor variables, including the race attribute, the model learned to correlate race with the criminality outcome. Then, the researchers were able to get a more accurate spotting of criminals. Researchers used variables that only related to a person’s criminality.
This was instead of if they had of built a model highly dependent on race and appearance.
The Takeaway
First, this research shows that relying on race as a predictor leads to a skewed outcome. Second, it also shows how ineffective the police would be by allowing such bias to be at the core of their decisions.
Visualizations of the predicted versus the actual data show how some locations with a high number of arrests could be completely missed if they were to depend on race. How we construct our models and the variables we use can truly affect people’s opportunities, livelihood, and overall well-being. Therefore, this must be handled ethically and responsibly.
As data scientists, our philosophy should be built on the pursuit of truth, not the manipulation of models to find the most convenient or profitable results at all costs, even at the cost of our ethics.
It is important that we include bias assessments as part of the process, so we can be more confident that our models are designed to better our understanding of people and make smarter decisions, not dumb and discriminatory decisions.
Maybe your boss isn’t Bill Lumberg, but if his understanding of analytics is limited to green and red (ad hoc), chances are you’ve rolled your eyes more than once.
Hello Peter, what’s happening? Ummm, I’m going to need you to go ahead and come in tomorrow to build that report. So, if you could be here around 9 that would be great, mmmk… oh oh! and I almost forgot ahh, I’m also going to need you to go ahead and come in on Sunday too, kay? We ahh don’t understand why our sales dropped this week and ah, we need to play catch up and analyze it.
Honestly how many times has this happened to you? Maybe your boss isn’t a Bill Lumberg, but if his understanding of analytics is limited to green = good and red = bad, chances are you’ve rolled your eyes in disgust more than once.
It happens a lot with Ad hoc analysis
And you’re not alone. Ad hoc analytics requests can make up 50% of an analytics team’s time. So what is a pragmatic analyst to do? According to Phil Kemelor, one strategy is to adopt a “don’t say yes” approach. Before committing to a request, ask yourself: • What is the business reason behind the request? • Will this kind of analysis answer the business question? • Do we know how long this will take? • Can we fit this into our funnel?
In other words, having an intake process to prioritize analytics requests can save teams a lot of weekend work.
When deciding whether to commit or pushback on a request, it’s also important to remember that your efforts will be wasted if the analysis cannot be acted upon. In other words, what will the business unit be able to do when they get access to your analysis? Will the organization be able to make changes to improve a bad situation? Nothing is more frustrating than spending a weekend building a report that subsequently gets printed out and put on a shelf to collect dust.
1. Knowing the difference between driving the car and fixing the engine
Brent Dykes also emphasizes the importance of understanding the difference between reporting and analysis. Reporting is the process of organizing data to monitor performance; analysis is the process of exploring data and reports to extract insights.
The former helps a company ensure that everything is running well, the latter is an investigative tool used to figure out what’s going on “underneath the hood.” Organizations that don’t understand the difference between the two are more susceptible to ask for more ad hoc requests.
2. Doesn’t self-serve solve this problem?
What about self-service tools? After all, if any employee has the potential to become a citizen data scientist, then the demand for ad hoc requests should drop, right? Perhaps, but the costs to the organization might outweigh the benefits. Literally.
The ad hoc reporting promise fails when ad hoc reports: • Are treated like official reports shared broadly across the organization• Perform shallow analysis that lacks real insight• Are subject to the author’s own confirmation bias
3. Take a stand, for the right reason
Not everyone has the luxury of saying no to their Bill Lumberg. But, as a recognized data expert in your organization you do have something much more powerful – credibility.
In the long run, this means that you have the ability to shape your company’s data strategy, and ultimately wean the business off random ad hoc analysis. Start flexing those muscles today.
Statistical distributions help us understand a problem better by assigning a range of possible values to the variables, making them very useful in data science and machine learning. Here are 7 types of distributions with intuitive examples that often occur in real-life data.
Types of Probability Distributions in ML Infographic
Whether you’re guessing if it’s going to rain tomorrow, betting on a sports team to win an away match, framing a policy for an insurance company, or simply trying your luck on blackjack at the casino, probability and distributions come into action in all aspects of life to determine the likelihood of events.
Having a sound statistical background can be incredibly beneficial in the daily life of a data scientist. Probability is one of the main building blocks of data science and machine learning. While the concept of probability gives us mathematical calculations, statistical distributions help us visualize what’s happening underneath.
Having a good grip on statistical distribution makes exploring a new dataset and finding patterns within a lot easier. It helps us choose the appropriate machine learning model to fit our data on and speeds up the overall process.
PRO TIP: Join our data science bootcampprogram today to enhance your data science skillset!
In this blog, we will be going over diverse types of data, the common distributions for each of them, and compelling examples of where they are applied in real life.
Before we proceed further, if you want to learn more about probability distribution, watch this video below:
Common types of data
Explaining various distributions becomes more manageable if we are familiar with the type of data they use. We encounter two different outcomes in day-to-day experiments: finite and infinite outcomes.
Difference between Discrete and Continuous Data (Source)
When you roll a die or pick a card from a deck, you have a limited number of outcomes possible. This type of data is called Discrete Data, which can only take a specified number of values. For example, in rolling a die, the specified values are 1, 2, 3, 4, 5, and 6.
Similarly, we can see examples of infinite outcomes from discrete events in our daily environment. Recording time or measuring a person’s height has infinitely many values within a given interval. This type of data is called Continuous Data, which can have any value within a given range. That range can be finite or infinite.
For example, suppose you measure a watermelon’s weight. It can be any value from 10.2 kg, 10.24 kg, or 10.243 kg. Making it measurable but not countable, hence, continuous. On the other hand, suppose you count the number of boys in a class; since the value is countable, it is discreet.
Types of statistical distributions
Depending on the type of data we use, we have grouped distributions into two categories, discrete distributions for discrete data (finite outcomes) and continuous distributions for continuous data (infinite outcomes).
Discrete distributions
Discrete uniform distribution: All outcomes are equally likely
In statistics, uniform distribution refers to a statistical distribution in which all outcomes are equally likely. Consider rolling a six-sided die. You have an equal probability of obtaining all six numbers on your next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6, equaling a probability of 1/6, hence an example of a discrete uniform distribution.
As a result, the uniform distribution graph contains bars of equal height representing each outcome. In our example, the height is a probability of 1/6 (0.166667).
Fair Dice Uniform Distribution Graph
Uniform distribution is represented by the function U(a, b), where a and b represent the starting and ending values, respectively. Similar to a discrete uniform distribution, there is a continuous uniform distribution for continuous variables.
The drawbacks of this distribution are that it often provides us with no relevant information. Using our example of a rolling die, we get the expected value of 3.5, which gives us no accurate intuition since there is no such thing as half a number on a dice. Since all values are equally likely, it gives us no real predictive power.
Bernoulli Distribution: Single-trial with two possible outcomes
The Bernoulli distribution is one of the easiest distributions to understand. It can be used as a starting point to derive more complex distributions. Any event with a single trial and only two outcomes follows a Bernoulli distribution. Flipping a coin or choosing between True and False in a quiz are examples of a Bernoulli distribution.
They have a single trial and only two outcomes. Let’s assume you flip a coin once; this is a single trail. The only two outcomes are either heads or tails. This is an example of a Bernoulli distribution.
Usually, when following a Bernoulli distribution, we have the probability of one of the outcomes (p). From (p), we can deduce the probability of the other outcome by subtracting it from the total probability (1), represented as (1-p).
It is represented by bern(p), where p is the probability of success. The expected value of a Bernoulli trial ‘x’ is represented as, E(x) = p, and similarly Bernoulli variance is, Var(x) = p(1-p).
Loaded Coin Bernoulli Distribution Graph
The graph of a Bernoulli distribution is simple to read. It consists of only two bars, one rising to the associated probability p and the other growing to 1-p.
Binomial Distribution: A sequence of Bernoulli events
The Binomial Distribution can be thought of as the sum of outcomes of an event following a Bernoulli distribution. Therefore, Binomial Distribution is used in binary outcome events, and the probability of success and failure is the same in all successive trials. An example of a binomial event would be flipping a coin multiple times to count the number of heads and tails.
Binomial vs Bernoulli distribution.
The difference between these distributions can be explained through an example. Consider you’re attempting a quiz that contains 10 True/False questions. Trying a single T/F question would be considered a Bernoulli trial, whereas attempting the entire quiz of 10 T/F questions would be categorized as a Binomial trial. The main characteristics of Binomial Distribution are:
Given multiple trials, each of them is independent of the other. That is, the outcome of one trial doesn’t affect another one.
Each trial can lead to just two possible results (e.g., winning or losing), with probabilities p and (1 – p).
A binomial distribution is represented by B (n, p), where n is the number of trials and p is the probability of success in a single trial. A Bernoulli distribution can be shaped as a binomial trial as B (1, p) since it has only one trial. The expected value of a binomial trial “x” is the number of times a success occurs, represented as E(x) = np. Similarly, variance is represented as Var(x) = np(1-p).
Let’s consider the probability of success (p) and the number of trials (n). We can then calculate the likelihood of success (x) for these n trials using the formula below:
For example, suppose that a candy company produces both milk chocolate and dark chocolate candy bars. The total products contain half milk chocolate bars and half dark chocolate bars. Say you choose ten candy bars at random and choosing milk chocolate is defined as a success. The probability distribution of the number of successes during these ten trials with p = 0.5 is shown here in the binomial distribution graph:
Binomial Distribution Graph
Poisson Distribution: The probability that an event may or may not occur
Poisson distribution deals with the frequency with which an event occurs within a specific interval. Instead of the probability of an event, Poisson distribution requires knowing how often it happens in a particular period or distance. For example, a cricket chirps two times in 7 seconds on average. We can use the Poisson distribution to determine the likelihood of it chirping five times in 15 seconds.
A Poisson process is represented with the notation Po(λ), where λ represents the expected number of events that can take place in a period. The expected value and variance of a Poisson process is λ. X represents the discrete random variable. A Poisson Distribution can be modeled using the following formula.
The main characteristics which describe the Poisson Processes are:
The events are independent of each other.
An event can occur any number of times (within the defined period).
Two events can’t take place simultaneously.
Poisson Distribution Graph
The graph of Poisson distribution plots the number of instances an event occurs in the standard interval of time and the probability of each one.
Continuous distributions
Normal Distribution: Symmetric distribution of values around the mean
Normal distribution is the most used distribution in data science. In a normal distribution graph, data is symmetrically distributed with no skew. When plotted, the data follows a bell shape, with most values clustering around a central region and tapering off as they go further away from the center.
The normal distribution frequently appears in nature and life in various forms. For example, the scores of a quiz follow a normal distribution. Many of the students scored between 60 and 80 as illustrated in the graph below. Of course, students with scores that fall outside this range are deviating from the center.
Normal Distribution Bell Curve Graph
Here, you can witness the “bell-shaped” curve around the central region, indicating that most data points exist there. The normal distribution is represented as N(µ, σ2) here, µ represents the mean, and σ2 represents the variance, one of which is mostly provided. The expected value of a normal distribution is equal to its mean. Some of the characteristics which can help us to recognize a normal distribution are:
The curve is symmetric at the center. Therefore mean, mode, and median are equal to the same value, distributing all the values symmetrically around the mean.
The area under the distribution curve equals 1 (all the probabilities must sum up to 1).
68-95-99.7 Rule
While plotting a graph for a normal distribution, 68% of all values lie within one standard deviation from the mean. In the example above, if the mean is 70 and the standard deviation is 10, 68% of the values will lie between 60 and 80. Similarly, 95% of the values lie within two standard deviations from the mean, and 99.7% lie within three standard deviations from the mean. This last interval captures almost all matters. If a data point is not included, it is most likely an outlier.
Probability Density and 68-95-99.7 Rule
Student t-Test Distribution: Small sample size approximation of a normal distribution
The student’s t-distribution, also known as the t distribution, is a type of statistical distribution similar to the normal distribution with its bell shape but has heavier tails. The t distribution is used instead of the normal distribution when you have small sample sizes.
Student t-Test Distribution Curve
For example, suppose we deal with the total apples sold by a shopkeeper in a month. In that case, we will use the normal distribution. Whereas, if we are dealing with the total amount of apples sold in a day, i.e., a smaller sample, we can use the t distribution.
Another critical difference between the students’ t distribution and the Normal one is that apart from the mean and variance, we must also define the degrees of freedom for the distribution. In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. A Student’s t distribution is represented as t(k), where k represents the number of degrees of freedom. For k=2, i.e., 2 degrees of freedom, the expected value is the same as the mean.
T-Distribution Table
Degrees of freedom are in the left column of the t-distribution table.
Overall, the student t distribution is frequently used when conducting statistical analysis and plays a significant role in performing hypothesis testing with limited data.
Exponential distribution: Model elapsed time between two events
Exponential distribution is one of the widely used continuous distributions. It is used to model the time taken between different events. For example, in physics, it is often used to measure radioactive decay; in engineering, to measure the time associated with receiving a defective part on an assembly line; and in finance, to measure the likelihood of the next default for a portfolio of financial assets. Another common application of Exponential distributions in survival analysis (e.g., expected life of a device/machine).
The exponential distribution is commonly represented as Exp(λ), where λ is the distribution parameter, often called the rate parameter. We can find the value of λ by the formula = 1/μ, where μ is the mean. Here standard deviation is the same as the mean. Var (x) gives the variance = 1/λ2
Exponential Distribution Curve
An exponential graph is a curved line representing how the probability changes exponentially. Exponential distributions are commonly used in calculations of product reliability or the length of time a product lasts.
Conclusion
Data is an essential component of the data exploration and model development process. The first thing that springs to mind when working with continuous variables is looking at the data distribution. We can adjust our Machine Learning models to best match the problem if we can identify the pattern in the data distribution, which reduces the time to get to an accurate outcome.
Indeed, specific Machine Learning models are built to perform best when certain distribution assumptions are met. Knowing which distributions, we’re dealing with may thus assist us in determining which models to apply.