Get ahead in data analysis with our summary of the top 7 must-know statistical techniques. Master these tools for better insights and results.
While the field of statistical inference is fascinating, many people have a tough time grasping its subtleties. For example, some may not be aware that there are multiple types of inference and that each is applied in a different situation. Moreover, the applications to which inference can be applied are equally diverse.
For example, when it comes to assessing the credibility of a witness, we need to know how reliable the person is and how likely it is that the person is lying. Similarly, when it comes to making predictions about the future, it is important to factor in not just the accuracy of the forecast but also whether it is credible.
Top statistical techniques – Data Science Dojo
Counterfactual causal inference:
Counterfactual causal inference is a statistical technique that is used to evaluate the causal significance of historical events. Exploring how historical events may have unfolded under small changes in circumstances allows us to assess the importance of factors that may have caused the event. This technique can be used in a wide range of fields such as economics, history, and social sciences. There are multiple ways of doing counterfactual inference, such as Bayesian Structural Modelling.
Overparametrized models and regularization:
Overparametrized models are models that have more parameters than the number of observations. These models are prone to overfitting and are not generalizable to new data. Regularization is a technique that is used to combat overfitting in overparametrized models. Regularization adds a penalty term to the loss function to discourage the model from fitting the noise in the data. Two common types of regularization are L1 and L2 regularization.
Generic computation algorithms:
Generic computation algorithms are a set of algorithms that can be applied to a wide range of problems. These algorithms are often used to solve optimization problems, such as gradient descent and conjugate gradient. They are also used in machine learning, such as support vector machines and k-means clustering.
Robust inference:
Robust inference is a technique that is used to make inferences that are not sensitive to outliers or extreme observations. This technique is often used in cases where the data is contaminated with errors or outliers. There are several robust statistical methods such as the median and the Huber M-estimator.
Bootstrapping and simulation-based inference are techniques that are used to estimate the precision of sample statistics and to evaluate and compare models. Bootstrapping is a resampling technique that is used to estimate the sampling distribution of a statistic by resampling the data with replacement.
Simulation-based inference is a method that is used to estimate the sampling distribution of a statistic by generating many simulated samples from the model.
Multilevel models are a class of models that are used to account for the hierarchical structure of data. These models are often used in fields such as education, sociology, and epidemiology. They are also known as hierarchical linear models, mixed-effects models, or random coefficient models.
Adaptive decision analysis:
Adaptive Decision Analysis is a statistical technique that is used to make decisions under uncertainty. It involves modeling the decision problem, simulating the outcomes of the decision and updating the decision based on the new information. This method is often used in fields such as finance, engineering, and healthcare.
Which statistical techniques are most used by you?
This article discusses most of the statistical methods that are used in quantitative fields. These are often used to infer causal relationships between variables.
The primary goal of any statistical way is to infer causality from observational data. It is usually difficult to achieve this goal for two reasons. First, observational data may be noisy and contaminated by errors. Second, variables are often correlated. To correctly infer causality, it is necessary to model these correlations and to account for any biases and confounding factors.
As statistical techniques are often implemented using specific software packages, the implementations of each method often differ. This article first briefly describes the papers and software packages that are used in the following sections. It then describes the most common statistical techniques and the best practices that are associated with each technique.
In this blog, we will introduce you to the highly rated data science statistics books on Amazon. As you read the blog, you will find 5 books for beginners and 5 books for advanced-level experts. We will discuss what’s covered in each book and how it helps you to scale up your data science career.
Advanced statistics books for data science
1. Naked Statistics: Stripping the Dread from the Data – By Charles Wheelan
The book unfolds the underlying impact of statistics on our everyday life. It walks the readers through the power of data behind the news.
Mr. Wheelan begins the book with the classic Monty Hall problem. It is a famous, seemingly paradoxical problem using Bayes’ theorem in conditional probability. Moving on, the book separates the important ideas from the arcane technical details that can get in the way. The second part of the book interprets the role of descriptive statistics in crafting a meaningful summary of the underlying phenomenon of data.
Wheelan highlights the Gini Index to show how it represents the income distribution of the nation’s residents and is mostly used to measure inequality. The later part of the book clarifies key concepts such as correlation, inference, and regression analysis explaining how data is being manipulated in order to tackle thorny questions. Wheelan’s concluding chapter is all about the amazing contribution that statistics will continue to make to solving the world’s most pressing problems, rather than a more reflective assessment of its strengths and weaknesses.
2. Bayesian Methods For Hackers – Probabilistic Programming and Bayesian Inference, By Cameron Davidson-Pilon
We mostly learn Bayesian inference through intensely complex mathematical analyses that are also supported by artificial examples. This book comprehends Bayesian inference through probabilistic programming with the powerful PyMC language and the closely related Python tools NumPy, SciPy, and Matplotlib.
Davidson-Pilon focused on improving learners’ understanding of the motivations, applications, and challenges in Bayesian statistics and probabilistic programming. Moreover, this book brings a much-needed introduction to Bayesian methods targeted at practitioners. Therefore, you can reap the most benefit from this book if you have a prior sound understanding of statistics. Knowing about prior and posterior probabilities will give an added advantage to the reader in building and training the first Bayesian model.
The second part of the book introduces the probabilistic programming library for Python through a series of detailed examples and intuitive explanations, with recent core developments and the popularity of the scientific stack in Python, PyMC is likely to become a core component soon enough. PyMC does have dependencies to run, namely NumPy and (optionally) SciPy. To not limit the user, the examples in this book will rely only on PyMC, NumPy, SciPy, and Matplotlib. This book is filled with examples, figures, and Python code that make it easy to get started solving actual problems.
3. Practical Statistics for Data Scientists – By Peter Bruce and Andrew Bruce
This book is most beneficial for readers that have some basic understanding of R programming language and statistics.
The authors penned the important concepts to teach practical statistics in data science and covered data structures, datasets, random sampling, regression, descriptive statistics, probability, statistical experiments, and machine learning. The code is available in both Python and R. If an example code is offered with this book, you may use it in your programs and documentation.
The book defines the first step in any data science project that is exploring the data or data exploration. Exploratory data analysis is a comparatively new area of statistics. Classical statistics focused almost exclusively on inference, a sometimes-complex set of procedures for drawing conclusions about large populations based on small samples.
To apply the statistical concepts covered in this book, unstructured raw data must be processed and manipulated into a structured form—as it might emerge from a relational database—or be collected for a study.
4. Advanced Engineering Mathematics by Erwin Kreyszig
Advanced Engineering Mathematics is a textbook for advanced engineering and applied mathematics students. The book deals with calculus of vector, tensor and differential equations, partial differential equations, linear elasticity, nonlinear dynamics, chaos theory and applications in engineering.
Advanced Engineering Mathematics is a textbook that focuses on the practical aspects of mathematics. It is an excellent book for those who are interested in learning about engineering and its role in society. The book is divided into five sections: Differential Equations, Integral Equations, Differential Mathematics, Calculus and Probability Theory. It also provides a basic introduction to linear algebra and matrix theory. This book can be used by students who want to study at the graduate level or for those who want to become engineers or scientists.
The text provides a self-contained introduction to advanced mathematical concepts and methods in applied mathematics. It covers topics such as integral calculus, partial differentiation, vector calculus and its applications to physics, Hamiltonian systems and their stability analysis, functional analysis, classical mechanics and its applications to engineering problems.
The book includes a large number of problems at the end of each chapter that helps students develop their understanding of the material covered in the chapter.
5. Computer Age Statistical Inference by Bradley Efron and Trevor Hastie
Computer Age Statistical Inference is a book aimed at data scientists who are looking to learn about the theory behind machine learning and statistical inference. The authors have taken a unique approach in this book, as they have not only introduced many different topics, but they have also included a few examples of how these ideas can be applied in practice.
The book starts off with an introduction to statistical inference and then progresses through chapters on linear regression models, logistic regression models, statistical model selection, and variable selection. There are several appendices that provide additional information on topics such as confidence intervals and variable importance. This book is great for anyone looking for an introduction to machine learning or statistics.
Computer Age Statistical Inference is a book that introduces students to the field of statistical inference in a modern computational setting. It covers topics such as Bayesian inference and nonparametric methods, which are essential for data science. In particular, this book focuses on Bayesian classification methods and their application to real world problems. It discusses how to develop models for continuous and discrete data, how to evaluate model performance, how to choose between parametric and nonparametric methods, how to incorporate prior distributions into your model, and much more.
5 Beginner level statistics books for data science
6. How to Lie with Statistics by Darrell Huff
How to Lie with Statistics is one of the most influential books about statistical inference. It was first published in 1954 and has been translated into many languages. The book describes how to use statistics to make your most important decisions, like whether to buy a house, how much money to give to charity, and what kind of mortgage you should take out. The book is intended for laymen, as it includes illustrations and some mathematical formulas. It’s full of interesting insights into how people can manipulate data to support their own agendas.
The book is still relevant today because it describes how people use statistics in their daily lives. It gives an understanding of the types of questions that are asked and how they are answered by statistical methods. The book also explains why some results seem more reliable than others.
The first half of the book discusses methods of making statistical claims (including how to make improper ones) and illustrates these using examples from real life. The second half provides a detailed explanation of the mathematics behind probability theory and statistics.
A common criticism of the book is that it focuses too much on what statisticians do rather than why they do it. This is true — but that’s part of its appeal!
7. Head-first Statistics: A Brain-Friendly Guide Book by Dawn Griffiths
If you are looking for a book that will help you understand the basics of statistics, then this is the perfect book for you. In this book, you will learn how to use data and make informed decisions based on your findings. You will also learn how to analyze data and draw conclusions from it.
This book is ideal for those who have already completed a course in statistics or have studied it in college. Griffiths has given an overview of the different types of statistical tests used in everyday life and provides examples of how to use them effectively.
The book starts off with an explanation of statistics, which includes topics such as sampling, probability, population and sample size, normal distribution and variation, confidence intervals, tests of hypotheses and correlation.
After this section, the book goes into more advanced topics such as regression analysis, hypothesis testing etc. There are also some chapters on data mining techniques like clustering and classification etc.
The author has explained each topic in detail for the readers who have little knowledge about statistics so they can follow along easily. The language used throughout this book is very clear and simple which makes it easy to understand even for beginners.
8. Think Stats By Allen B. Downey
Think Stats is a great book for students who want to learn more about statistics. The author, Allen Downey, uses simple examples and diagrams to explain the concepts behind each topic. This book is especially helpful for those who are new to mathematics or statistics because it is written in an easy-to-understand manner that even those with a high school degree can understand.
The book begins with an introduction to basic counting, addition, subtraction, multiplication and division. It then moves on to finding averages and making predictions about what will happen if one number changes. It also covers topics like randomness, sampling techniques, sampling distributions and probability theory.
The author uses real-world examples throughout the book so that readers can see how these concepts apply in their own lives. He also includes exercises at the end of each chapter so that readers can practice what they’ve learned before moving on to the next section of the book. This makes Think Stats an excellent resource for anyone looking for tips on improving their math skills or just wanting to brush up on some statistical basics!
9. An Introduction To Statistical Learning With Applications In R By Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Statistical learning with applications in R is a guide to advanced statistical learning. It introduces modern machine learning techniques and their applications, including sequential decision-making, Gaussian mixture models, boosting, and genetic programming. The book covers methods for supervised and unsupervised learning, as well as neural networks. The book also includes chapters on Bayesian statistics and deep learning.
It begins with a discussion of correlation and regression analysis, followed by Bayesian inference using Markov chain Monte Carlo methods. The authors then discuss regularization techniques for regression models and introduce boosting algorithms. This section concludes with an overview of neural networks and convolutional neural networks (CNNs). The remainder of the book deals with topics such as kernel methods, support vector machines (SVMs), regression trees (RTs), naive Bayes classifiers, Gaussian processes (GP), gradient ascent methods, and more.
This statistics book is recommended to researchers willing to learn about statistical machine learning but do not have the necessary expertise in mathematics or programming languages
10. Statistics in Plain English By Timothy C. Urdan
Statistics in Plain English is a writing guide for students of statistics. Timothy in his book covered basic concepts with examples and guidance for using statistical techniques in the real world. The book includes a glossary of terms, exercises (with solutions), and web resources.
The book begins by explaining the difference between descriptive statistics and inferential statistics, which are used to draw conclusions about data. It then covers basic vocabulary such as mean, median, mode, standard deviation, and range.
In Chapter 2, the author explains how to calculate sample sizes that are large enough to make accurate estimates. In Chapters 3–5 he gives examples of how to use various kinds of data: census data on population density; survey data on attitudes toward various products; weather reports on temperature fluctuations; and sports scores from games played by teams over time periods ranging from minutes to seasons. He also shows how to use these data to estimate the parameters for models that explain behavior in these situations.
The last 3 chapters define the use of frequency distributions to answer questions about probability distributions such as whether there’s a significant difference between two possible outcomes or whether there’s a trend in a set of numbers over time or space
Which data science statistics books are you planning to get?
Build upon your statistical concepts and successfully step into the world of data science. Analyze your knowledge and choose the most suitable book for your career to enhance your data science skills. If you have any more suggestions for statistics books for data science, please share them with us in the comments below.
At some point, every aspiring data scientist has to get familiar with mathematics for machine learning.
To be blunt, the more serious you are about learning data science, the more math you’ll need to learn for machine learning. If you have a strong math background, this is likely to be a little issue.
In my case, I’ve had to relearn much of mathematics (note – I’m not done yet!) that I took at a university as my professional life had allowed my math skills to atrophy.
Based on my experience teaching our Bootcamp there is also a group of aspiring data scientists that fall into a category where their formal math training needs to be augmented. For example, we have many students that come from marketing backgrounds where, for example, studying linear algebra was never a requirement.
What math skills do data scientists need in machine learning
Forms of the question “what math do I need for data science” and “what math do I need for machine learning” are popular on sites like Quora. I would encourage all aspiring data scientists to perform their own research on this subject and not to take my post as gospel. However, as I often get asked for my opinion on what math aspiring data scientists need to know/study, I will provide my own list:
Basic statistics and probability (e.g., normal and student’s t distributions, confidence intervals, t-tests of significance, p-values, etc.).
Linear algebra (e.g., eigenvectors)
Single variable calculus (e.g., minimization/maximization using derivatives).
Multivariate calculus (e.g., minimization/maximization with gradients).
Please note that the above is not an exhaustive list. To be honest, you likely can never know enough math to help you as a data scientist. What I would argue is the above list represents the 80/20 rule – the 20% of math that you will use 80% of the time as a practicing data scientist.
A list of top math resources
Here’s my list of the top 80/20 math resources for aspiring data scientists:
The cartoon guide to statistics by Larry Gonick
The Cartoon Guide to Statistics is one of the books we provide to our bootcamp students, and it is an excellent resource for gently learning – or refreshing – your statistical knowledge. It covers many of the basic concepts in statistics in easy-to-consume and an entertaining fashion. Well worth a read.
3rd edition of Open Intro statistics book by David, Christopher, and Mine
Coursera’s Statistics with R Specialization is necessary for every aspiring data scientist. The accompanying textbook (pictured to the left) is also a great read. I liked the book so much I picked up a hard copy from Amazon.
Book about mathematics and economics
Interestingly, I’ve found that University of California Irvine’s free UCI Open course Math 4: Math for Economists is a most excellent resource for focusing on the specific aspects of linear algebra and multivariate calculus needed for aspiring data scientists.
The accompanying textbook is also quite good and covers several interesting subjects, including single variable calculus for folks that need a refresher.
The takeaway
Studying the above resources will allow you to go a long way in developing the math skills required for data science.
A regular expression is a sequence of characters that specifies a search pattern in a text. Learn more about Its common uses in this regex 101 guide.
Regular Expressions Infographic
What is a regular expression?
A regular expression, or regex for short, is perhaps the most common thing that every data scientist must deal with multiple times in their career, and the frustration is natural because, at a vast majority of universities, this skill is not taught unless you have taken some hard-core Computer Science theory course. Even at times, trying to solve a problem using regular expression can cause multiple issues, which is summed beautifully in this meme:
Regular Expressions Meme
Making sense and using them can be quite daunting for beginners, but this RegEx 101 blog has got you covered. Regular expressions, in essence, are used to match strings in the text that we want to find. Consider the following scenario:
You are interested in counting the number of occurrences of Pandas in a journal related to endangered species to see how much focus is on this species. You write an algorithm that calculates the occurrences of the word ‘panda.’ However, as you might have noticed, your algorithm will miss the words ‘Panda’ and ‘Pandas.’ In this case, you might argue that a simple if-else condition will also count these words. But imagine, while converting the journal alphabets of every word is randomly capitalized or converted into a lower-case letter. Now, there are the following possibilities
PaNda
pAnDaS
PaNdA
And may more variations as well. Now you must write a lot of if-else conditions and even must write nested if-else conditions as well. What if I tell you that you can do this in one line of code using regular expressions? First, we need to learn some basics before coming back to solve the problem ourselves.
PRO TIP: Join our data science bootcamp program today to enhance your data science knowledge!
Square Brackets ([])
The name might sound scary, but it is nothing but the symbol: []. Some people also refer to square brackets as character class – a regular expression jargon word that means that it will match any character inside the bracket. For instance:
Pattern
Matches
[Pp]enguin
Penguin, penguin
[0123456789]
(This will match any digit)
[0oO]
0, o, O
Disjunction (|)
The pipe symbol means nothing but either ‘A’ or ‘B’, and it is helpful in cases where you want to select multiple strings simultaneously. For instance:
Pattern
Matches
A|B|C
A, B, C
Black|White
Black, White
[Bb]lack|[Ww]hite
Black, black, White, white
Question Mark (?)
The question mark symbol means that the character it comes after is optional. For instance:
Pattern
Matches
Ab?c
Ac, Abc
Colou?r
Color, Colour
Asterisk (*)
The asterisk symbol matches with 0 or more occurrences of the earlier character or group. For instance:
Pattern
Matches
Sh*
(0 or more of earlier character h)
S, Sh, Shh, Shhh.
(banana)*
(0 or more of earlier banana. This will also match with nothing, but most regex engines will ignore it or give you a warning in that case)
banana, bananabanana, bananabananabanana.
Plus (+)
The plus symbol means to match with one or more occurrences of the earlier character or group. For instance:
Pattern
Matches
Sh+
(1 or more of earlier character h)
Sh, Shh, Shhh.
(banana)+
(1 or more of the earlier banana)
banana, bananabanana, bananabananabanana.
Difference between Asterisk (*) and Plus(+)
The difference between the asterisk confuses many people; even the experts sometimes must look at the internet for their differences. However, there is an effortless way to remember the distinction between them.
Imagine you have a number 1, and you multiply it with 0:
1*0 = 0 or more occurrences of earlier character or group.
Now suppose that you have the same number 1, and you add it with 0:1+0 = 1 or more occurrences of an earlier character or group.
It is that simple when you try to understand things intuitively.
Negation (^)
Negation has two everyday use cases:
1. Inside square brackets, it will search for the negation of whatever is inside the brackets. For instance:
Pattern
Matches
[^Aa]
It will match with anything that is not A or a
[^0123456789]
It will match anything that is not a digit
2. It can also be used as an anchor to search for expressions at the start of the line(s) only. For instance:
Pattern
Matches
^Apple
It will match with every Apple that is at the start of any line in the text
^(Apple|Banana)
It will match with every Apple and Banana that is at the start of any line in the text
Dollar ($)
A dollar is used to search for expressions at the end of the line. For instance:
Pattern
Matches
$[0123456789]
It will match with any digit at the end of any line in the text.
$([Pp]anda)
It will match with every Panda and panda at the end of any line in the text.
Conclusion
This article covered some of the very basic types of regular expression by using a story-telling approach. Regular expressions are indeed very hard to understand but if one develops an intuitive understanding then they can be easily learned.