For a hands-on learning experience to develop LLM applications, join our LLM Bootcamp today.
First 5 seats get a 30% discount! So hurry up!

Look into data science myths in this blog. The field of Data is an ever-growing field and often you’ll come across buzzwords surrounding it. Being a trendy field, sometimes you will come across statements about it that might be confusing or entirely a myth. Let us bust these myths, and ensure your doubts are clarified!

What is Data Science?

In simple words, data science involves using models and algorithms to extract knowledge from data available in various forms. The data could be large or small or could be structured such as a table or unstructured such as a document containing text and images containing spatial information. The role of the data scientist is to analyze this data and extract information from the data which can be used to make data-driven decisions.

data science myths, data science compass
The Flawed Data Science Compass

Myths

Now, let us dive into some of the myths:

1. Data Science is all about building machine learning and deep learning models

Although building models is a key aspect, it does not define the entirety of the role of a Data Scientist. A lot of work goes on before you proceed with building these models. There is a common saying in this field that is “Garbage in, garbage out.” Real-life data is rarely available in a clean and processed form, and a lot of effort goes into pre-processing this data to make it useful for building models. Up to 70% of the time can be consumed in this process.

This entire pipeline can be split up into multiple stages including acquiring, cleaning, and pre-processing data, visualization, analyzing, and understanding it, and only then are you able to build useful models with your data. If you are building machine learning models using the readily available libraries, your code for your model might end up being less than 10 lines! So, it is not a complex part of your pipeline.

2. Only people with a programming or mathematical background can become Data Scientists

Another myth surrounding is that only people coming from certain backgrounds can pursue a career in it, which is not the case at all! Data science is a handy tool that can help a business enhance its performance in almost every field.

For example, human resources is a field that might be distant from statistics and programming, but it has a very good implementation of data science as a use case. IBM, by collecting employee data, has built an internal AI system that can predict when an employee might quit using machine learning. A person with domain knowledge about the human resource field will be the best fit for building this model.

Regardless of your background, you can learn it online with our top-rated courses from scratch. Join one of our top-rated programs including Data Science Bootcamp and Python for Data Science and get started!

Join our Data Science Bootcamp today to start your career in the world of data. 

3. Data Analysts, Data Engineers, and Data Scientists all perform the same tasks

Data Analysts and Data Scientists roles have overlapping responsibilities. Data analysts carry out descriptive analytics, collecting current data and making informed decisions using it. For example, a data analyst might notice a drop in sales and will try to uncover the underlying cause using the collected company data. Data Scientists also make these informed business decisions. However, they involve using statistics and machine learning to predict the future!

Data Scientists use the same collection of data but use it to make predictive models that can predict future decisions and guide the company on the right actions to take before something happens. Data engineers on the other hand build and maintain data infrastructures and data systems. They’re responsible for setting up data warehouses and building databases where the collected data is stored.

4. Large data results in more accurate models

This myth might be partially wrong but partially right as well. Large data does not necessarily translate to higher accuracy of your model. More often, the performance of your model depends on how well you carry out the cleaning of your dataset and extraction of the features. After a certain point, the performance of your model will start to converge regardless of how much you increase the size of your dataset.

As per the saying “garbage in, garbage out”, if the data you have provided for the model is noisy and not properly processed, likely, the accuracy of the model will also be poor. Therefore, to enhance the accuracy of your models, you must ensure that the quality of the data you are providing is up to the mark. Only a greater quantity of relevant data will positively impact your model’s accuracy!

5. Data collection is the easiest part of data science

When learning how to build machine learning models, you would often go to open data sources and download a CSV or Excel file with a click of a button. However, data is not that readily available in the real world and you might need to go to extreme lengths to acquire it.

Once acquired, it will not be formatted and in an unstructured form and you will have to pre-process it to make it structured or meaningful. It can be a difficult, challenging, and time-consuming task to source, collect and pre-process data. However, this is an important part because you cannot build a model without any data!

Data comes from numerous sources and is usually collected over a period by using automation or manual resources. For example, for building a health profile of a patient, data about their visits will be recorded. Telemetry data from their health device such as sensors can be collected and so on. This is just the case for one user. A hospital might have thousands of patients they deal with every day. Think about all the data!

Please share with us some of the myths that you might have encountered in your data science journey.

Want to upgrade your data science skillset? checkout our Python for Data Science training. 

Working within the Data Science industry has made me religiously follow a few data science blogs that I use to stay up to date with industry trends, learn new concepts, and understand the vernacular.

As a new member, these three things were originally hard for me to grasp until I started reading everything I could. These are the data science blogs I follow, and you should too.

Data science blogs I follow:

R-Bloggers

 

R bloggers logo
R-bloggers logo

R-bloggers began when creator, Tal Galili, was fed-up with trying to find blogs about R. Instead of continuing his search, Tal created a site that pulls feeds from contributing blogs. R-bloggers “is a blog aggregator of content contributed by bloggers who write about R”. If your blog is all about R, you can create an RSS feed and contribute to the “R blogosphere”. This aggregator is a great place to find different blogs, especially if you’re new to the industry (like me).

Towards Data Science

towards-data-science-logo
Towards Data Science Logo

Whether you enjoy data science as a hobby or a profession, you should be reading Towards Data Science (TDS). In October 2016 TDS joined Medium with the goal of “gathering good posts and distributing them to a broader audience”. Now, Towards Data Science includes 1,500 authors from around the world. TDS offers contributors an editorial team to help raise the quality of posts being submitted. While reading an article on TDS, you know you’re getting high-quality content you can trust.

KDNuggets

kdnuggets-logo
Kdnuggets logo

KDnuggets is another staple of data science blogs. The site has received so many impressive awards, I couldn’t decide which ones to list. You’ll have to settle with viewing them yourself.

It may seem messy when you first visit, but, much like original Reddit users, That’s the way I like it, and the 500,000 monthly visitors would probably agree. Posts range from courses and tutorials to news, meetings, and opinions. Like TDS, KDnuggets offers high-quality content you can trust to help you learn.

Entrepreneur

entrepreneur-logo
Entrepreneur logo

Entrepreneur is different than the three blogs above. Instead of focusing solely on anything within data science, it keeps its content specifically about how data science and big data affect entrepreneurship and small business. This blog is great for entrepreneurs and small business owners who want to absorb the concepts into their businesses. The market for using data science to make data-driven business decisions is growing and should not be overlooked.

DataFloq

data-floq-logo
DataFloq Logo

One of my favorite things about DataFloq is how easy it is to navigate the site. It has a list of tags at the top of the Articles page that makes sorting through the posts very easy. It’s also easy to find events going on around the world.

The blog itself mostly focuses on big data, artificial intelligence, and new technologies. I usually find myself cruising through the AI or IoT tags.  There’s always a new article to read about one of those topics. You can also see how many views the article has received without having to click on it. I use it to gauge what the quality of the content’s like within the post. The higher the views, typically the higher the quality will be. If you’re looking for anything to do with new, emerging technologies, I suggest browsing DataFloq.

Dataconomy

 

dataconomy-logo
Dataconomy logo

I use Dataconomy almost strictly for learning about the trends within the blockchain. It isn’t updated as frequently as DataFloq, or the other above blogs, but it still gives helpful insights into what is trending within the data science industry.

Dataconomy prides itself in having a global network of contributors that don’t just look at the major tech companies. Authors are encouraged to find new and promising tech startups that will take the world by storm.

Who do you follow?

Is there a data science blog you think I have to read? Let me know! Follow the discussion link below to start a conversation. I’m always looking for new blogs to read to continue my data science education and learn new industry trends.

US-AI vs China-AI – What does the race for AI mean for data science worldwide? Why is it getting a lot of attention these days?

Although it may still be recovering from the effects of the government shutdown, data science has received a lot of positive attention from the United States Government. Two major recent milestones include the OPEN Government Data Act, which passed in January as part of the Foundations for Evidence-Based Policymaking Act, and the American AI Initiative, which was signed as an executive order on February 11th.

The future of data science and AI

The first thing to consider is why and more specifically the US administration has passed these recent measures. Although it’s not mentioned in either of the documents, any political correspondent who has been following these topics could easily explain that they are intended to stake a claim against China.

China has stated its intention to become the world leader in data science and AI by 2030. And with far more government access, data sets (a benefit of China being a surveillance state), and an estimated $15 billion in machine learning, they seem to be well on their way. In contrast, the US has only $1.1 billion budgeted annually for machine learning.

So rather than compete with the Chinese government directly, the US appears to have taken the approach of convincing the rest of the world to follow their lead, and not China’s. They especially want to direct this message to the top data science companies and researchers in the world (especially Google) to keep their interest in American projects.

So, what do these measures do?

On the surface, both the OPEN Government Data Act and the American AI Initiative strongly encourage government agencies to amp up their data science efforts. The former is somewhat self-explanatory in name, as it requires agencies to publish more machine-readable publicly available data and requires more use of this data in improved decision making. It imposes a few minimal standards for this and also establishes the position of Chief Data Officers at federal agencies. The latter is somewhat similar in that it orders government agencies to re-evaluate and designate more of their existing time and budgets towards AI use and development, also for better decision making.

Critics are quick to point out that the American AI Initiative does not allocate more resources for its intended purpose, nor does either measure directly impose incentives or penalties. This is not much of a surprise given the general trend of cuts to science funding under the Trump administration. Thus, the likelihood that government agencies will follow through with what these laws ‘require’ has been given skeptical estimations.

However, this is where it becomes important to remember the overall strategy of the current US administration. Both documents include copious amounts of values and standards that the US wants to uphold when it comes to data, machine learning, and artificial intelligence. These may be the key aspects that can hold up against China, having a government that receives a hefty share of international criticism for its use of surveillance and censorship. (Again, this has been a major sticking point for companies like Google.)

These are some of the major priorities brought forth in both measures: Make federal resources, especially data and algorithms, available to all data scientists and researchers; Prepare the workforce for technology changes like AI and optimization; Work internationally towards AI goals while maintaining American values; and finally, Create regulatory standards, to protect security and civil liberties in the use of data science.

So there you have it. Both countries are undeniably powerhouses for data science. China may have the numbers in its favor, but the US would like the world to know that they have an American spirit.

Not working for both? –  US-AI vs China-AI

In short, the phrase “a rising tide lifts all ships” seems to fit here. While the US and China compete for data science dominance at the government level, everyone else can stand atop this growing body of innovations and make their own.

The thing data scientists can get excited about in the short term is the release of a lot of new data from US federal sources or the re-release of such data in machine-readable formats. The emphasis is on the public part – meaning that anyone, not just US federal employees or even citizens, can use this data. To briefly explain for those less experienced in the realm of machine learning and AI, having as much data to work with as possible helps scientists to train and test programs for more accurate predictions.

A lot of what made the government shutdown a dark period for data scientists suggests the possibility of a golden age shortly.

R and Python remain the most popular data science programming languages. But if we compare r vs python, which of these languages is better?

As data science becomes more and more applicable across every industry sector, you might wonder which programming language is best for implementing your models and analysis. If you attend a data science Bootcamp, Meetup, or conference, chances are you’ll run into people who use one of these languages.

Since R and Python remain the most popular languages for data science, according to  IEEE Spectrum’s latest rankings, it seems reasonable to debate which one is better. Although it’s suggested to use the language you are most comfortable with and one that suits the needs of your organization, for this article, we will evaluate the two languages. We will compare R and Python in four key categories: Data Visualization, Modelling Libraries, Ease of Learning, and Community Support.

Data visualization

A significant part of data science is communication. Most of the time, you as a data scientist need to show your result to colleagues with little or no background in mathematics or statistics. So being able to illustrate your results in an impactful and intelligible manner is very important. Any language or software package for data science should have good data visualization tools.

Good data visualization involves clarity. No matter how complicated your model is, there will be a simple and unambiguous way of illustrating your results such that even a layperson would understand.

Python

Python is renowned for its extensive number of libraries. There are plenty of libraries that can be used for plotting and visualizations. The most popular libraries are matplotlib and  seaborn. The library matplotlib is adapted from MATLAB, it has similar features and styles. The library is a very powerful visualization tool with all kinds of functionality built in. It can be used to make simple plots very easily, especially as it works well with other Python data science libraries, pandas and numpy.

Although matplotlib can make a whole host of graphs and plots, what it lacks is simplicity. The most troublesome aspect is adjusting the size of the plot: if you have a lot of variables it can get hectic trying to neatly fit them all into one plot. Another big problem is creating subplots; again, adjusting them all in one figure can get complicated.

Now, seaborn builds on top of matplotlib, including more aesthetic graphs and plots. The library is surely an improvement on matplotlib’s archaic style, but it still has the same fundamental problem: creating figures can be very complicated. However, recent developments have tried to make things simpler.

R

Many libraries could be used for data visualization in R but ggplot2 is the clear winner in terms of usage and popularity? The library uses a grammar of graphics philosophy, with layers used to draw objects on plots. Layers are often interconnected to each other and can share many common features. These layers allow one to create very sophisticated plots with very few lines of code. The library allows the plotting of summary functions. Thus, ggplot2 is more elegant than matplotlib and thus I feel that in this department R has an edge.

It is, however, worth noting that Python includes a ggplot library, based on similar functionality as the original ggplot2 in R. It is for this reason that R and Python both are on par with each other in this department.

Modelling libraries

Data science requires the use of many algorithms. These sophisticated mathematical methods require robust computation. It is rarely or maybe never the case that you as a data scientist need to code the whole algorithm on your own. Since that is incredibly inefficient and sometimes very hard to do so, data scientists need languages with built-in modelling support. One of the biggest reasons why Python and R get so much traction in the data science space is because of the models you can easily build with them.

Python

As mentioned earlier Python has a very large number of libraries. So naturally, it comes as no surprise that Python has an ample amount of machine learning libraries. There is scikit-learnXGboostTensorFlowKeras and PyTorch just to name a few. Python also has pandas, which allows tabular forms of data. The library pandas make it very easy to manipulate CSVs or Excel-based data.

In addition to this Python has great scientific packages like numpy. Using numpy, you can do complicated mathematical calculations like matrix operations in an instant. All of these packages combined, make Python a powerhouse suited for hardcore modelling.

R

R was developed by statisticians and scientists to perform statistical analysis way before that was such a hot topic. As one would expect from a language made by scientists, one can build a plethora of models using R. Just like Python, R has plenty of libraries — approximately 10000 of them. The mice package, rpartparty and caret are the most widely used. These packages will have your back, starting from the pre-modelling phase to the post-model/optimization phase.

Since you can use these libraries to solve almost any sort of problem; for this discussion let’s just look at what you can’t model. Python is lacking in statistical non-linear regression (beyond simple curve fitting) and mixed-effects models. Some would argue that these are not major barriers or can simply be circumvented. True! But when the competition is stiff you have to be nitpicky to decide which is better. R, on the other hand, lacks the speed that Python provides, which can be useful when you have large amounts of data (big data).

Ease of learning

It’s no secret that currently data science is one of the most in-demand jobs, if not the one most in demand. As a consequence, many people are looking to get on the data science bandwagon, and many of them have little or no programming experience. Learning a new language can be challenging, especially if it is your first. For this reason, it is appropriate to include ease of learning as a metric when comparing the two languages.

Python

Designed in 1989 with a philosophy that emphasizes code readability and a vision to make programming easy or simple, the designers of Python succeeded as the language is fairly easy to learn. Although Python takes inspiration for its syntax from C, unlike C it is uncomplicated. I recommend it as my choice of language for beginners since anyone can pick it up in relatively less time.

R

I wouldn’t say that R is a difficult language to learn. It is quite the contrary, as it is simpler than many languages like C++ or JavaScript. Like Python, much of R’s syntax is based on C, but unlike Python R was not envisioned as a language that anyone could learn and use, as it was specifically initially designed for statisticians and scientists. IDEs such as RStudio have made R significantly more accessible, but in comparison with Python, R is a relatively more difficult language to learn.

In this category Python is the clear winner. However, it must be noted that programming languages in general are not hard to learn. If a beginner wanted to learn R, it won’t be as easy in my opinion as learning Python but it won’t be an impossible task either.

Community support

Every so often as a data scientist you are required to solve problems that you haven’t encountered before. Sometimes you may have difficulty finding the relevant library or package that could help you solve your problem. To find a solution, it is not uncommon for people to search in the language’s official documentation or online community forums. Having good community support can help programmers, in general, to work more efficiently.

Both of these languages have active Stack overflow members and also an active mailing list available (where one can easily ask for solutions from experts). R has online R-documentation where you can find information about certain functions and function inputs. Most Python libraries like pandas and scikit-learn have their official online documentation that explains each library.

Both languages have a significant amount of user base, hence, they both have a very active support community. It isn’t difficult to see that both seem to be equal in this regard.

Why R?

R has been used for statistical computing for over two decades now. You can get started with writing useful code in no time. It has been used extensively by data scientists and has an insane number of packages available for a lot of data science-related tasks. I have almost always been able to find a package in R to get the task done very quickly. I have decent python skills and have written production code in python. Even with that, I find R slightly better for quickly testing out ideas, trying out different ways to visualize data and for rapid prototyping work.

Why Python?

Python has many advantages over R in certain situations. Python is a general-purpose programming language. Python has libraries like pandas, NumPy, scipy and sci-kit-learn, to name a few which can come in handy for doing data science-related work.

If you get to the point where you have to showcase your data science work, Python once would be a clear winner. Python combined with Django is an awesome web application framework, which can help you create a web service/site with both your data science and web programming done in the same language.

You may hear some speed and efficiency arguments from both camps – ignore them for now. If you get to a point when you are doing something substantial enough where the speed of your code matters to you, you will probably figure out things on your own. So don’t worry about it at this point.

You can learn Python for data science with Data Science Dojo!

R and Python – The most popular languages

Considering that you are a beginner in both data science and programming and that you have a background in Economics and Statistics, I would lean towards R. Besides being very powerful, Python is without a doubt one of the friendliest programming languages to beginners – but it is still a programming language. Your learning curve may be a bit steeper in Python as opposed to R.

You should learn Python, once you are comfortable with R, and have grasped the general concepts of data science – which will take some time. You can read “What are the key skills of a data scientist? To get an idea of the skill set you will need to become a data scientist.

Start with R, transition to Python gradually and then start using both as needed. Both are great for data science but one is better than the other in certain situations.

The number of applications for data scientist programs has increased. With various online resources, is it necessary to take a university degree in Data Science?

Data Science is one of the fastest-growing fields, and the data shows this trend will continue into the near future. Data Science has become the backbone of many fields – it is the data science that helps us make sense of the information we collect during marketing campaigns, and it is the data science that helps us construct economic models that predict macroeconomic trends. It’s a field bustling with technological innovation, and people studying it will be at the forefront of multiple industries in the years and decades to come.

If you are someone who wants to join the ranks of data scientists, you have multiple ways of achieving your goals, including going to a university, taking online data science courses, and lastly self-learning. Which of these approaches is the best one? Is it still necessary to go to university to have the best prospects of landing a job? This article will answer these questions and help you decide how to approach this exciting new field.

chart-for-is-it-necessary-to-go-to-a-university-to-become-a-data-scientist_small
Data, graphs, and analytics

Why might you still need a university degree?

The days that universities were for diving into academic studies are long gone. The recent advances in technology and the plethora of online resources have made it extremely easy for motivated individuals to learn on their own.

Instead, the university is a place for you to socialize and network with influential people from your field of study. While we like to think we live in a meritocracy where people succeed by skill alone, that has never been true. It is not only about what you know; it is about who you know.

Your university will give you numerous chances to present yourself and your skills to eminent professors and influential people who’d be able to help you start a successful career. It is much easier to jump-start your career when you have direct access to employers instead of being one of the hundreds of online resumes, they receive each day.

auditorium
An empty auditorium

The difficulty of getting the fundamentals right without an academic setting

Not all academic fields are created equal when it comes to online teaching platforms. There are certain fields of study like computer science and language studies that rely mostly on a passive intake of information, and that makes them excellent subjects to learn online.

Other subjects like philosophy and mathematics require methodological approaches and engaging extensively with professors and classmates, and these present significant hurdles for a self-learner. They’ll have to try harder to learn the concepts and follow the material if they want to learn these subjects, and many online learners aren’t motivated to do so.

While data science is looked at as a subfield of computer science, it requires a good grounding in the fundamentals of Calculus and extensive knowledge of statistics and probability. Due to the field’s heavy reliance on math, an online learner might have trouble handling the subjects.

A good university will provide you with receptive professors and like-minded fellow students that’ll help you engage with the harder subjects and stay motivated.

Innovative approaches making universities obsolete

While self-study textbooks and online video courses have been on the market for decades now, a wave of innovations in teaching methods is starting to threaten our traditional institutions, and the top two approaches, which might prove to be more effective than universities, are interactive learning platforms and gamified learning:

Interactive learning platforms

These were developed in the hopes of making the online learner more proactive. Studies have shown that passively listening to online courses without participation isn’t an effective method of learning.

If you use these platforms, you won’t just learn what a piece of computer code does, but you’ll be asked to use it to solve a problem. You won’t just be told about price equilibrium in Economics, but the platform will tell you to explain a system using the theory. This way you will be able to immediately apply the knowledge you’ve acquired, which makes learning the fields like economics and mathematics much easier.

Gamified learning

One thing the last decade has shown us is how effective games are in capturing people’s attention and gluing them to their seats. That’s why some educators and psychologists have done extensive research to help bring over some aspects of gaming to education.

Correct use of gaming principles in a learning system will make it easier for you to focus on learning more, retain more of the information, and feel less fatigue after long studying sessions. While this method is still in its infancy, it is already showing great promise.

The show, don’t tell: How can you start a career as a data scientist

While choosing to opt out of enrolling in a university might prevent you from networking, and it is really hard for online resumes to help you stand out, there are new ways and platforms where you can show your skills!

Competition Sites

Competition sites like Kaggle provide an excellent training ground for budding data scientists to show their skills. They provide competition from diverse fields from economics to computer vision. The people who come up with the best algorithms not only get monetary rewards, but they have a great chance of getting job offers. Most employers will be impressed if you achieve good results in these competitions as it shows a practical understanding of the field beyond academics.

Github and Jupyter Notebooks

Github and Jupyter Notebook allows you to present data analyses in a readable and concise format. Instead of boring old CVs, employers are more receptive to a rich portfolio. Thanks to the tools being completely free and intuitive to use, you’re only limited by your skills when it comes to the projects you tackle. You can build an amazing portfolio from the comfort of your home.

Conclusion

The answer isn’t cut-and-dry, and while there have been some movements claiming universities have become completely redundant in 2018, there are still some real benefits to them. You should ask yourself if you’d thrive in an academic setting, if yes, then you’d probably see sizable benefits from attending university. On the other hand, the new approaches to learning and portfolio building have made it easier than ever to succeed on your own, and you can do it if you are motivated enough.

You might also like: Is it worth going to university anymore?

High dimensional data is a necessary skill for every data scientist to have. Break the curse of dimensionality with Python.

What’s common in most of us is that we are used to seeing and interpreting things in two or three dimensions to the extent that the thought of thinking visually about the higher dimensions seems intimidating. However, you can get a grasp of the core intuition and ideas behind the enormous world of dimensions.

Let’s focus on the essence of dimensionality reduction. Many data science tasks often demand that we work with higher dimensions in the data. In terms of machine learning models, we often tend to add several features to catch salient features even though they won’t provide us with any significant amount of new information. Using these redundant features, the performance of the model will deteriorate after a while. This phenomenon is often referred to as “the curse of dimensionality”.

The curse of high dimensionality

When we keep adding features without increasing the data used to train the model, the dimensions of the feature space become sparse since the average distance between the points increases in high dimensional space. Due to this sparsity, it becomes much easier to find a convenient and perfect, but not so optimum, solution for the machine learning model. Consequently, the model doesn’t generalize well, making the predictions unreliable. You may also know it as “overfitting”. It is necessary to reduce the number of features considerably to boost the model’s performance and to arrive at an optimal solution for the model.

What if I tell you that fortunately for us, in most real-life problems, it is often possible to reduce the dimensions in our data, without losing too much of the information the data is trying to communicate? Sounds perfect, right?

Now let’s suppose that we want to get an idea about how some high-dimensional data is arranged. The question is, would there be a way to know the underlying structure of the data? A simple approach would be to find a transformation. This means each of the high-dimensional objects is going to be represented by a point on a plot or space, where similar objects are going to be represented by nearby points and dissimilar objects are going to be represented by distant points.

High Dimensional Space to Low Dimensional Space
Figure 1: Transformation from High Dimensional Space to Low Dimensional Space

 

High Dimensional Space to Low Dimensional Space 2
Figure 2: Transformation from High Dimensional Space to Low Dimensional Space

Figure 1 illustrates a bunch of high-dimensional data points we are trying to embed in a two-dimensional space. The goal is to find a transformation such that the distances in the low dimensional map reflect the similarities (or dissimilarities) in the original high dimensional data. Can this be done? Let’s find out.

As we can see from Figure 2, after a transformation is applied, this criterion isn’t fulfilled because the distances between the points in the low dimensional map are non-identical to the points in the high dimensional map. The goal is to ensure that these distances are preserved. Keep this idea in mind since we are going to come back on this later!

Now we will break down the broad idea of dimensionality reduction into two branches – Matrix Factorization and Neighbor Graphs. Both cover a broad range of techniques. Let’s look at the core of Matrix Factorization.

matrix factorization
Figure 3: Matrix Factorization

Matrix factorization

The goal of matrix factorization is to express a matrix as approximately a product of two smaller matrices. Matrix A represents the data where each row is a sample and each column, a feature. We want to factor this into a low dimensional interpretation UU, multiplied by some exemplars VV, which are used to reconstruct the original data. A single sample of your data can be represented as one row of your interpretation UiUi, which acts upon the entire matrix of the exemplars VV. Therefore, each individual row of matrix A can be represented as a linear combination of these exemplars and the coefficients of that linear combination are your low dimensional representation. This is matrix factorization in a nutshell.

You might have noticed the approximation sign in Figure 3 and while reading the above explanation you might have wondered why are we using the approximation sign here? If a bigger matrix can be broken down into two smaller matrices, then shouldn’t they both be equal? Let’s figure that out!

In reality, we cannot decompose matrix AA such that it’s exactly equal to  U⋅VU⋅V. But what we would like to do is to break down a matrix such that U⋅VU⋅V is as close as possible to AA when reconstructed using the representation UU and exemplars VV. When we talk about approximation, we are trying to introduce the notion of optimization. This means that we’re going to try and minimize some form of error, which is between our original data AA and the reconstructed data A′A′ subject to some constraints. Variations of these losses and constraints give us different matrix factorization techniques. One such variation brings us to the idea of Principle Component Analysis.

Neighbor graphs: The theme revolves around building a graph using the data and embedding that graph in a low-dimensional space. There are two queries to be answered at this point. How do we build a graph and then how do we embed that graph? Assume there is some data that has some non-linear low dimensional structure to it. A linear approach like PCA will not be able to find it. However, if we somehow draw a graph of the data where each point close to or a k-nearest neighbor of that point is linked with an edge, then there is a fair chance that the graph might be able to uncover the structure that we wanted to find. This technique introduces us to the idea of t-distributed stochastic neighbor embedding.

Now let’s begin the journey to find how these transformations work using the algorithms introduced.

Principal components analysis

 

Scatter plot
Figure 4: Scatter Plot

 

Let’s follow the logic behind PCA. Figure 4 shows a scatter plot of points in 2D. You can clearly see two different colors. Let’s think about the distribution of the red points. These points must have some mean value ¯xx¯ and a set of characteristic vectors V1V1 and V2V2. What if I tell you that the red points can be represented using only their V1V1 coordinates plus the mean. Basically, we can think of these red points on a line and all you will be given is their position on the line. This is essentially giving out the coordinates for the xx axis while ignoring the yy coordinates since we don’t care by what amount are the points off the line. So we have just reduced the dimensions from 2 to 1.

Going by this example, it doesn’t look like a big deal. But in higher dimensions, this could be a great achievement. Imagine you’ve got something in 20,000-dimensional space. If somehow you are able to find a representation of approximately around 200 dimensions, then that would be a huge reduction.

Now let me explain PCA from a different perspective. Suppose, you wish to differentiate between different food items based on their ingredients. According to you, which variable will be a good choice to differentiate food items? If you choose an ingredient that varies a lot from one food item to another and is not common among different foods, then you will be able to draw a difference quite easily. The task will be much more complex if the chosen variable is consistent across all the food items. Coming back to reality, we don’t usually find such variables which segregate the data perfectly into various classes. Instead, we have the ability to create an entirely new variable through a linear combination of original variables such that the following equation holds:

Y=2×1−3×2+5x3Y=2×1−3×2+5×3

Recall the concept behind linear combination from matrix factorization. This is what PCA essentially does. It finds the best linear combinations of the original variables so that the variance or spread along the new variable is maximum. This new variable YY is known as the principle component. PCA1PCA1 captures the maximum amount of variance within our data, and PCA2PCA2 accounts for the largest remaining variance in our data,  and so on. We talked about minimizing the reconstruction error in matrix factorization. Classical PCA employs mean squared error as a loss between the original and reconstructed data.

MNIST Visualization using PCA
Figure 5: MNIST Visualization using PCA

 

Let’s generate a three-dimensional plot for PCA/reduced data using the MNIST-dataset with the help of Hypertools. This is a Python toolbox for gaining geometric insights into high-dimensional data. The dataset consists of 70,000 digits consisting of 10 classes in total. The scatter plot in Figure 5 shows a different color for each digit class. From the plot, we can see that the low dimensional space holds some significant information since a bunch of clusters are visible. One thing to note here is that the colors are assigned to the digits based on the input labels which were provided with the data.

But what if this information wasn’t available? What if we convert the plot into a grayscale image? You’d not be able to make much out of it, right? Maybe on the bottom right, you would see a little bit of high-density structure. But it would basically just be one blob of data. So, the question is can we do better? Is PCA minimizing the right objective function/loss? And if you think about PCA, then basically what it’s doing, it’s mainly concerned with preserving large pairwise distances in the map. It’s trying to maximize variance which is sort of the same as trying to minimize a squared error between distances in the original data and distances on the map. When you are looking at the squared error, you’re mainly concerned with preserving distances that are very large.

Is this what an informative visualization of the low-dimensional space should look like? Definitely not. If you think about the data in terms of a nonlinear manifold (Figure 6), then you would see that in this case, the Euclidean distance between two points on this manifold would not reflect their true similarity quite accurately. The distance between these two points suggests that these two are similar, whereas if you consider the entire structure of the data manifold, then they are very far apart. The key idea is that the PCA doesn’t work so well for visualization because it preserves large pairwise distances which are not reliable. This brings us to our next key concept.

Non-Linear Manifold graph
Figure 6: Non-Linear Manifold

t-stochastic Neighboring Embedding (t-SNE)

In a high-dimensional space, the intent is to measure only the local similarities between points, which basically means measuring similarities between the nearby points.

High to Low Dimensional Space
Figure 7: Map from High to Low Dimensional Space

 

Looking at Figure 7, let’s focus on the yellow data point in the high dimensional space. We are going to center a Gaussian over this point xijxij to measure the density of all the other points under this Gaussian. This gives us a set of probabilities pijpij, which basically measures the similarity between pairs of points ij. This probability distribution describes a pair of points, where the probability of picking any given pair of points is proportional to the similarity. If two points are close together in the original high dimensional, the value for pijpij is going to be large. If two points are dissimilar in space then pijpij is basically going to be infinitesimal.

Let’s now devise a way to use these local similarities to achieve our initial goal of reducing dimensions. We will look at the two/three-dimensional space. This will be our final map after transformation. The yellow point can be represented using yiyi. The same approach will be implemented here as well. We are going to center another kernel over this point yiyi to measure the density of all the other points under this distribution. This gives us a probability qijqij, which measures the similarity of two points in the low dimensional space. The goal is that these probabilities qijqij should reflect the similarities pijpij as effectively as possible. To retain and uncover the underlying structure in the data, all of qijqij should be identical to all of pijpij so that the structure of the transformed data is quite similar to the structure of the data in the original high dimensional space.

To check the similarity between these distributions we use a different measure known as KL divergence. This ensures that if two points are close in the original space, then the algorithm will ensure to place them in the vicinity in the low dimensional space. Conversely, if the two points are far apart in the original space, the algorithm is relatively free to place these points around. We want to lay out these points in the low dimensional space to minimize the distance between these two probability distributions to ensure that the transformed space is a model of the original space. Again, a mathematical algorithm known as gradient descent is applied to KL divergence, which is just moving the points around in space iteratively to find the optimum setting in the low dimensional space so that this KL divergence becomes as small as possible.

The key takeaway here is that the high dimensional space distance is based on a Gaussian distribution, however, we are using a Student-t distribution in the embedded space. This is also known as heavy-tailed Student-t distribution. This is where the t in t-SNE comes from.

This whole explanation boils down to one question: Why this distribution? So let’s suppose we want to project the data down to two dimensions from a 10-dimensional hyper-cube. If we think realistically then we can never preserve all the pairwise distances accurately. We need to compromise and find a middle ground somewhere. t-SNE tried to map similar points close to each other and dissimilar ones far apart.

Figure 8 illustrates this concept, where we have three points in the two-dimensional space. The yellow lines are the small distances that represent the local structure. The distance between the corner of the triangle is a global structure so it is denoted as a large pairwise distance. We want to transform this data into one-dimensional while preserving local structure. Once the transformation is done, you can see that the distance between the two points that are far apart has grown. Using the heavy tail distribution, we are allowing this phenomenon to happen. If we have two points that have a pairwise distance of let’s say 20, and using a Gaussian gives a density of 2, then to get the same density under the student t-distribution, because of the heavy tails, these points have to be 30 or 40 apart. So for dissimilar points, the heavy-tailed qijqij basically allows dissimilar points to be modeled far apart in the map than they were in the original space.

 

points in two-dimensional space
Figure 8: Three points in two-dimensional space

 

Let’s try to run the algorithm using the same dataset. I have used Dash to build a t-SNE visualization. Dash is a productive Python framework for building data visualization apps. This can be rendered in the web browser. A great thing is that you can hover over a data point to check from which class it belongs to. Isn’t that amazing? All you need to do is to input your data in CSV format and it does the magic for you.

So, what you see here is t-SNE running gradient descent, doing the learning while calculating KL divergence. There is much more structure than there is in the PCA plot. You can see that the 10 different digits are well separated in the low-dimensional map. You can vary the parameters and monitor how the learning varies. You may want to read this great article distilling how to interpret the results of t-SNE. Remember, the labels of the digits were not used to generate this embedding. The labels are only used for the purpose of coloring the plot. t-SNE is a completely unsupervised algorithm. This is a huge improvement over PCA because if we didn’t even have the labels and colors then we would still be able to separate out the clusters.

MNIST visualization using t-SNE
Figure 9: MNIST visualization using t-SNE

Autoencoders

Architecture_of_an_auto-encoder-1
Figure 10: Architecture of an auto-encoder

The word ‘auto-encoder‘ has been floating around for a while now. Auto-encoder may sound like a cryptic name at first but it’s not. It is basically a computational model whose aim is to learn a representation or encoding for some sort of data. Fortunately for us, this encoding can be used for the task of dimensionality reduction.

To explain the working of auto-encoders in the simplest terms, let’s look back at what PCA was trying to do. All PCA does is that it finds a new coordinate system for your data such that it aligns with the orientation of maximum variability in the data. This is a linear transformation since we are simply rotating the axis to end up in a new coordinate system. One variant of an auto-encoder can be used to generalize this linear dimension reduction.

How does it do that?

As you can see in Figure 10, there are a bunch of hidden units where the input data is passing. The input and output units of an auto-encoder are identical since the idea is to learn an un-supervised compression of our data using the input. The latent space contains a compressed representation of the image, which is the only information the decoder is allowed to use to try to reconstruct the input. If these units and the output layers are linear then the auto-encoder will learn a linear function of the data and try to minimize the squared reconstruction error.

That’s what PCA was doing, right? In terms of principle components, the first N hidden units will represent the same space as the first N components found by PCA, which accounts for the maximum variance. In terms of learning, auto-encoders go beyond this. They can be used to learn complex non-linear manifolds, as well as uncover the underlying structure of data.

In the PCA projection (Figure 11),  you can see that’s how the clusters are merged for the digits 5, 3, and 8 and that’s sole because they are all made up of similar pixels. This visualization is as a result of capturing a 19.8% variance in the original data set only.

Coming towards t-SNE, now we can hope that it will give us a much more fascinating projection of the latent space. The projection shows denser clusters. This shows that in the latent space, the same digits are close to one another. We can see that the digits 5, 3, and 8 are now much easier to separate and appear in small clusters as the axis rotates.

These embeddings on the raw mnist input images have enabled us to visualize what the encoder has managed to encode in its compressed layer representation. Brilliant!

Figure 11: PCA visualization of the Latent Space

 

Explore

Phew, so many techniques! Well, there are a bunch of other techniques that we have not discussed here. It will be very difficult to identify an algorithm that always works better than all the rest. The superiority of one algorithm over another depends on the context and the problem we are trying to solve. What is common in all these algorithms is that the algorithms try to preserve some properties and sacrifice some other, in order to achieve some specific goal. You should dive deep and explore these algorithms for yourself! There is always a possibility that you can discover a new perspective that works well for you.

In the famous words of Mick Jagger, “You can’t always get what you want, but if you try sometimes, you just might find you get what you need.”

Learn more about high dimensional data

  • Towards-an-interpretable-latent-space