data scientist

Dhannush Subramani
| October 19, 2022

This blog will learn about “Data Science career growth in 2022”. It is no longer a secret that today’s economy is entirely dependent on analytics and data-driven solutions/decisions. 

 

Businesses, enterprises, and governments have spent the last few years collecting and analyzing massive volumes of data. If you are interested in the field of Data Science enroll in some Data Science courses offered by reputed Institutions which will be an added advantage during your job hunt. 

 

data science career growth
7 questions everyone asks about data science career growth

 

Data scientists are currently playing a crucial part in the success or failure of any organization, one can even consider choosing a proper Data science certification program which will help learn practically as well as theoretically. Therefore, it is not a stretch to state that “there is a data scientist behind every huge successful company.”

 

Overview of Data Science career

Data science is a fascinating, interesting, intriguing, forward-thinking, and lucrative profession. Importantly, unlike other traditional careers, you do not need an established degree or specialized educational background to begin your journey in Data Science.

 

All you need are the proper abilities, some connected experience, and a curious mind. Considering the need for data scientists in the current market trends indicate that data science course fees are growing.

 

In this blog, I’ll go over the ins and outs of the data scientist job path, as well as the abilities necessary for data Science. In addition, I’ll guide you on how to choose which data science career is best for you.

 

Alright!! Let’s dive into the topics.

 

Table of Contents:

  • What is Data Science?
  • What does a Data Scientist do?
  • Is Data Science right for you?
  • Why choose a career in Data Science?
  • Job statistics in Data Science career
  • Are you ready to become a Data Scientist?

 

What is Data Science?

 

Data science is the study of massive amounts of data using current tools and methodologies to discover previously unknown patterns, extract valuable information, and make business choices. 

 

Data for analysis can come from a wide range of sources and be provided in a variety of ways.

 

Now that you know what data science is, let’s look at what a Data Scientist will do in 2022.

 

What does a Data Scientist do?

 

Data science is a highly interdisciplinary field that works with a broad variety of data and, unlike other analytical fields, focuses on the overall perspective.

data science career
Data scientist working on data – Data Science Dojo

 

In business, the purpose of data science is to give an insight into customers and campaigns, as well as to aid organizations in building effective plans to engage their audiences and sell their products. 

 

Big data, or enormous amounts of information gathered through different methods such as data mining, necessitates the use of creative thinking on the part of data scientists. So, what exactly does a data scientist do?

 

Data scientists use forecasting models to evaluate data and information to produce key insights that help enterprises expand their businesses in the right direction. One of the key responsibilities is to analyze large data sets of quantitative and qualitative data. 

 

This personnel is in charge of developing statistical learning models for data analysis and must be knowledgeable with statistical tools. They must also be knowledgeable enough to create complex prediction models.

Is Data Science right for you?

In my opinion, it is crucial to have an answer to this issue before embarking on your path in data science. Unfortunately, many blogs on the internet indicate that the area of data science is full of demand, great incomes, and respect. 

Nevertheless, the fact is that your journey to data science is not at all easy; it takes continual learning and unlearning of complicated subjects and concepts from different professions, and you must be technically knowledgeable throughout your career.

 

Learn more about Data Science Roadmap 

In this section, I’ll provide you with some suggestions that will take you to the answer to this question. Fundamentally, anyone can acquire and practice any data science skill if they are truly committed to it.

 

Simply said, if you want to learn data science, you can do so.

 

Why choose a career in Data Science?

Data science has been termed the “sexiest job of the twenty-first century.” I’m sure this is a significant role in your decision to pursue a career in data science. Nowadays, any company, large or little, is looking for employees who can interpret and dissect data.

 

Choosing a profession in data science involves respecting the numerous disciplines on which data science as a subject has been founded, such as statistics, math, and technology, among others. The variety of abilities required to become a data scientist might be considered an advantage.

 

Now, let me direct your attention to a few key reasons why you should pursue a career in data science;

 

  • High prestige
  • Be part of future
    Excellent pay
  • Constant challenging work or NO boring work
    Exceptional growth & demand in the market
  • Endless career opportunities

 

Data Science has shown the ability to transform companies and our society. It has become a lucrative job due to a limited supply of trained workers in Data Science and high demand.

 

Job statistics in Data Science career

If you’re here, I’m presuming you’ve picked or are thinking about choosing a career path. Let me direct your attention to a few more key criteria which might assist you in making your final decision.

  • 650% Job growth since 2015 (Via: Linkedin)
  • By 2026, 11.5 million additional jobs are expected to be created (source: U.S. Bureau of Labor Statistics)
  • A data scientist earns an average annual income of $120,931. (source: Glassdoor)
  • In 2020, there are expected to be 2.7 million available positions in data analysis, data science, and related fields (source: IBM).
  • By 2020, there will be a 39% increase in employer demand for both data scientists and data engineers (source IBM).
  • 59% of employment will be in finance, information technology (IT), insurance, and professional services. This is divided as follows: 
  • 19% in banking and insurance, 18% in professional services, and 17% in information technology.
  • Bachelor’s degree holders will be able to apply for 61% of data scientist and advanced analytic roles, while 39% will require a master’s or Ph.D.
  • Positions in data science and data analysis are available for 5 days longer than the average for all jobs, indicating that there is less competition in these professional sectors and recruiters must work harder to locate competent individuals.
  • A possible annual salary of $8,736 more than any other bachelor’s degree position (source: IBM).

 

Pro-Tip: Build up your Data Science career as a licensed Data Scientist

 

The data presented above indicates the development and need for data science specialists across various business areas, geographical regions, and even experience levels. As more businesses implement data-driven solutions, the need for data scientists will continue to rise.

 

So, relax, you’re on the correct track!

 

Are you ready to become a Data Scientist?

Data science is the most in-demand career this decade and will continue to be so in the future. With increased awareness of the industry, competition for positions among professionals is at an all-time high. If you follow this approach and do honest self-evaluation, I am confident you will make the best decision for you.

Enroll in Data Science Bootcamp today to begin your Data Science career

Remember that selecting the proper career path is only the beginning of your journey.

 

Julia Grosvenor
| August 21, 2019

Want to know what kind of data scientist you are (or maybe even what kind you should be)? Take this simple quiz.

The position title is often used pretty vaguely, and that’s because there are so many ways to be one. For every specialty and job responsibility, there are corresponding personality types and skills to match. We created this quiz to tell you what kind of data scientist you are (or maybe even what kind you should be), based on some characters from pop culture!

Quiz for data scientist

If you enjoyed this, some of the questions in this quiz were based off actual interview questions from companies like Amazon and Google. Check out our master list of 101 Data Science Interview Questions for more. Or if you even want to expand your data skills, join our Data Science Bootcamp program today.

Do you agree with your result? Comment below!

| July 22, 2022

This blog post will provide you with a comprehensive data science roadmap that can aid your learning, helping you succeed in a world loaded with data.

As of 2020, the average salary that a data scientist makes in the US is over $113,000. With that stated, it can be affirmed that data scientists are high in demand. You can think of data science as a way to earn money but then you will never have the actual motivation to learn it. Instead, you should identify a problem; be it marketing-related or a research problem, and then start learning data science & its tools accordingly, because you cannot excel at every tool or a data science skill set. 

First & foremost, you need to motivate yourself to love the data, with no drive you will probably leave your learning journey at some point. Furthermore, you need to work on real projects. Just acquiring the fundamental knowledge or skills won’t make you an expert data scientist, likewise, to increase your expertise, you need to increase the level of difficulty every time you undertake a data science project. While being at work or by joining a top-rated Data Science Bootcamp, learn from your instructors & peers, and check how they are executing the data science projects. Last but not least, present your insights & analysis to others.

But you might be wondering what skills do you exactly require for being a successful data scientist & how to Learn Data Science? What steps do you need to follow to leap into the field of data science?  

Before we get started with the actual data science career path, which of the following expertise/skills do you have?

 

An insight of Data Science roadmap

Since now you have a know-how of what skills you already possess, the roadmap below can help you understand where you stand & what effort is needed for you to reach the endpoint.

Read more about Data Science Career 

Data science roadmap
Comprehensive career guide to data science – Data Science Dojo

Step 1: Getting started 

Before you move on to learning & adapting to new skills, it is important for you to understand what data science is & whether you are a great fit for data science or not.

This article by innoarchitech precisely explains what data science is, it further enlightens on the roles of data scientists, data engineers, and data analysts that can surely help you in deciding which boat to jump in.  

To further assess, check what type of data scientist you are with the below short quiz: 

Step 2: Learn the basics of mathematics & statistics  

The next checkpoint in the data science career path is to learn the fundamentals of mathematics & statistics. The topics listed below should be your area of focus: 

  1. Descriptive Statistics 
  2. Probability  
  3. Inferential Statistics  
  4. Linear Algebra 
  5. Structured Thinking  

This cheat sheet by MIT can help you build your concepts for statistics & likewise here is another cheat sheet by Wzchen that can help you with understanding the basics of probability.  

You can further enrich your concepts with these 5 free statistics books, along with these amazing resources to learn math for data science. If you are wondering why math is needed, then you need to do a quick browse at this blog post by Dave Langer from Data Science Dojo that explains why math is important in data science.  

Step 3: Acquainting the key tools for data science 

1. Python: It is one of the most popular & widely used programming languages. Learning this language can help you with creating web applications, handling big data, rapid prototyping, and much more. To know more about python, check this introductory blog post for it.  

Learn all the fundamentals of Python for Data Science with our upcoming training! 

2. R: Another popular language for programming in R. It provides a free software environment for statistical computing. These few blog posts can definitely add value to your knowledge of R programming:  

  1. Logistics Regression in R
  2. R language programming for Excel Users
  3. Natural language Processing with R programming books

You might be stuck with the same traditional argument between R Versus Python; if you are wondering which one of them you should opt for, then I did suggest you begin with R and transition to Python gradually. Then use them as per the needs of your organization.  

3. Data Exploration & Visualization: If you are into the analytical side of the data i.e. data analysis then you must learn data exploration & visualization. Data exploration is the initial step of data analysis, while, data visualization is the graphical representation of data itself. Both Python & R can be used for exploring & visualizing the data.

Step 4: Learning the key tools for ML 

There exist some basic and advanced machine learning tools that you need to learn & adapt yourself with. Some of the most important ones are listed below. These skills can be of immense value in your overall data science roadmap:  

  1. Exploratory Data Analysis & Data Cleaning: Before moving on to the ML tools, you need to be well versed with what EDA & data cleaning is. EDA or exploratory data analysis is a way of studying the datasets to summarize them into a visual format. Data cleaning is the process of detecting & correcting errors, and ensuring that the data is free of errors.

     The below cheat sheet & the article here can help you get started with EDA now.

EDA cheatsheet for data science professionals
EDA cheat sheet consisting of non-graphical analysis, univariate analysis and multivariate analysis

 2. Feature Selection & Engineering: This should typically be your next step in learning ML. This uses domain knowledge to obtain the features from the data, which in turn helps with improving the performance of ML algorithms. So, if you are willing to gain expertise in the ML domain, you need to learn about feature selection & engineering.

3.  Model Selection: Out of all the statistical models, you will need to select one model that is well-suited for your problem. These are some of the statistical models that you can go with:

A. Linear Regression: It is an algorithm of supervised machine learning, where the slope is constant & the predicted output is continuous. To get started with linear regression, check out this comprehensive cheat sheet by MIT

B. Logistic Regression: It is an algorithm for supervised learning classification that is used to predict the probability of a target variable. It is typically used for classification purposes. This article can be a great resource for you to get started with logistic regression in R. 

C. Decision Trees: This generally uses a decision tree to form assumptions & conclusions about the target values. It is one of the most common approaches of predictive modeling used in statistics & machine learning. 

To build your understanding of a decision tree, this comprehensive tutorial can be of great help to you. 

D. K-Nearest Neighbor (KNN): It is one of the most simple supervised machine learning algorithms that can help with resolving regression & classification problems. It is quite easy to comprehend and learn. But has a few drawbacks

E. K-Means: This is an unsupervised learning algorithm that units the unlabeled sets into diverse clusters. Where K represents the numeral of the centroid. This cheat sheet from Stanford university can help you with learning about K-Means.

F. Naïve Bayes: It is one of the algorithms for supervised learning that helps in solving classification problems. It is considered one of the most successful algorithms because of its nature to create fast ML models can help with making predictions. Here you can find more about Naïve Bayes. 

G. Dimensionality Reduction: A process of transforming the high-dimension space to a low-dimension space to maintain the meaningful properties of data. 

Learning dimensionality reduction is an important skill that every data scientist must possess. Break the curse of dimensionality with Python

H. Random Forests: It is an ensemble learning method for classification, regression, and other task purposes. It includes drawing multiple decision trees at a time & outputting the class that is the mode of all. Dive deep with this amazing guide by Berkley University

I. Gradient Boosting Machines: One of the leading techniques to build predictive models. It helps to deal with regression & classification problems and creates a prediction model in the form of an ensemble of the weak prediction models. 

This guide can help you get started with Gradient Boosting Machines.  

J. XGBOOST: This tool specifically helps with executing the gradient boosted decision trees devised for speed and performance. 

Find answers to what is XGBOOST, how to build an intuition for it, and much more with the guide here

K. Support Vector Machines: These are supervised learning models that are coupled with associated learning, they aid in evaluating the data for regression & classification analysis.  

The below graphic by Avik Jain can be a great help for you to get started with SVMs: 

Support vector machines
Detailed information about support vector machine and tuning parameters

4.  Model Evaluation: Moving towards the last step of machine learning, model evaluation, generalizes the accuracy of the model based on future data. It typically uses two methods, holdout & cross-validation.

Confusion matrix
An image defining the confusion matrix of the classifier

Step 5: Profile building 

Building a profile on GitHub is an important task that every data scientist must complete. It is one of the most effective ways for a data scientist to gather all the code of the projects they have undertaken. It showcases your code and projects undertaken and shows how long you have been practicing data science.  

To get started check this cheat sheet on GitHub

Moving on, you need to be part of some discussion forums. These will help you find an answer to the questions you are stuck at. Here are some of the discussion forums you can be part of: 

  1. Quora  
  2. Stackoverflow 

To gain more knowledge in the data science domain, start following different YouTube channels.   
Our YouTube channel can surely be a good start for you.  

Step 6: Prepare for a data science interview  

You need to know all those key data science concepts that can help you ace your interviews. With these 101 Data Science Interview Questions. Answers, and Key Concepts you can prep up yourself for the interviews.

Step 7: Take a look at a typical data scientist’s job 

Reaching the end of your data science roadmap, you might want to get an idea of a typical data scientist’s job. It is always helpful to look at some job descriptions, showcase your skills, and stand out as the best candidate. If you think you are a good fit for it, you must get started right away!

 

Before I end this post, let me repeat it again, instead of trying to learn all the skills required to be a data scientist endlessly, pick up a problem that inspires you or bees relevant to your domain. Try to solve that problem using the data science skills, only pick up the skills necessary to solve that problem. As you solve more problems, you will learn more skills along the way.

If you hated probability in high school or university, it is because every example of probability has to do with coin tosses and dice. But if you happen to come across interesting problems, such as the Birthday Paradox, you might have ended up loving probability.

Additional support  

Want to learn more about data science roadmap? The following blog posts have been a great support to me, and likewise, I believe it can be a great help to you as well:

So, what have you decided? Are planning to get started with Data Science? Take a look at our Data Science Bootcamp, a great way to start your data science journey.

Dave Langer
| January 5, 2017

Process Mining is a critical skill needed by every data scientist and analyst for mining rich and varied data contained in event logs.

Event logs are everywhere and represent a prime source of big data. Event log sources run the gamut from e-commerce web servers to devices participating in globally distributed Internet of Things (IoT) architectures.

Even Enterprise Resource Planning (ERP) systems produce event logs! Given the rich and varied data contained in event logs, process mining these assets is a critical skill needed by every data scientist, business/data analyst, and program/product manager.

At the meetup for this topic, presenter David Langer showed how easy it is to get started process mining your event logs using the OSS tools of R and ProM.

David began the talk by defining which features of a dataset are important for event log mining:

Activity: A well-defined step in some workflow/process.

Timestamp: The date and time at which something worthy of note happened.

Resource: Staff and/or other assets used/consumed in the execution of an activity.

Event: At a minimum, the combination of an activity and a timestamp. Optionally, events may have associated resources, life cycle, and other data.

Case: A related set of events denoted, and connected, by a unique identifier where the events can be ordered.

Event Log: A list of cases and associated events.

Trace: A distinct pattern of case activities within an event log where each activity is present at most once per trace. Event log typically contain many traces.

Below is an example of IIS Web Server data that may be used for process mining:

intro_event_log_meetup Process mining

 

In this example, the traces for this event log are:

  1. portal, dashboard, purchase order report
  2. portal, help, contact us
  3. portal, my team, expense reports

David proceeded his talk with a live demo using the Incident Activity Records dataset from the 2014 Business Processing Intelligence Challenge (BPIC).

About the meetup

In this presentation hosted by Data Science Dojo:• The scenarios and benefits of event log mining• The minimum data required for event log mining• Ingesting and analyzing event log data using R• Process Mining with ProM• Event log mining techniques to create features suitable for Machine Learning models• Where you can learn more about this very handy set of tools and techniques for process mining.

Process mining source code

David’s source code can be viewed and cloned here, at his GitHub repository for this meetup. To clean and process the dataset, he ran through his R script step-by-step. David installed the R package, edeaR, which was specifically used to analyze and the dataset.

After cleaning the dataset, he loaded the new .csv file into the process mining workbench tool, ProM, for visualization. The visualization created helped gain insights about the flow of incident activities from open to close.

intro_event_log_meetup_02

Speaker: David Langer

Kevin McGowan
| April 1, 2019

The number of applications for data scientist programs has increased. With various online resources, is it necessary to take a university degree in Data Science?

Data Science is one of the fastest-growing fields, and the data shows this trend will continue into the near future. Data Science has become the backbone of many fields – it is the data science that helps us make sense of the information we collect during marketing campaigns, and it is the data science that helps us construct economic models that predict macroeconomic trends. It’s a field bustling with technological innovation, and people studying it will be at the forefront of multiple industries in the years and decades to come.

If you are someone who wants to join the ranks of data scientists, you have multiple ways of achieving your goals, including going to a university, taking online data science courses, and lastly self-learning. Which of these approaches is the best one? Is it still necessary to go to university to have the best prospects of landing a job? This article will answer these questions and help you decide how to approach this exciting new field.

chart-for-is-it-necessary-to-go-to-a-university-to-become-a-data-scientist_small
Data, graphs, and analytics

Why might you still need a university degree?

The days that universities were for diving into academic studies are long gone. The recent advances in technology and the plethora of online resources have made it extremely easy for motivated individuals to learn on their own.

Instead, the university is a place for you to socialize and network with influential people from your field of study. While we like to think we live in a meritocracy where people succeed by skill alone, that has never been true. It is not only about what you know; it is about who you know.

Your university will give you numerous chances to present yourself and your skills to eminent professors and influential people who’d be able to help you start a successful career. It is much easier to jump-start your career when you have direct access to employers instead of being one of the hundreds of online resumes, they receive each day.

auditorium
An empty auditorium

The difficulty of getting the fundamentals right without an academic setting

Not all academic fields are created equal when it comes to online teaching platforms. There are certain fields of study like computer science and language studies that rely mostly on a passive intake of information, and that makes them excellent subjects to learn online.

Other subjects like philosophy and mathematics require methodological approaches and engaging extensively with professors and classmates, and these present significant hurdles for a self-learner. They’ll have to try harder to learn the concepts and follow the material if they want to learn these subjects, and many online learners aren’t motivated to do so.

While data science is looked at as a subfield of computer science, it requires a good grounding in the fundamentals of Calculus and extensive knowledge of statistics and probability. Due to the field’s heavy reliance on math, an online learner might have trouble handling the subjects.

A good university will provide you with receptive professors and like-minded fellow students that’ll help you engage with the harder subjects and stay motivated.

Innovative approaches making universities obsolete

While self-study textbooks and online video courses have been on the market for decades now, a wave of innovations in teaching methods is starting to threaten our traditional institutions, and the top two approaches, which might prove to be more effective than universities, are interactive learning platforms and gamified learning:

Interactive learning platforms

These were developed in the hopes of making the online learner more proactive. Studies have shown that passively listening to online courses without participation isn’t an effective method of learning.

If you use these platforms, you won’t just learn what a piece of computer code does, but you’ll be asked to use it to solve a problem. You won’t just be told about price equilibrium in Economics, but the platform will tell you to explain a system using the theory. This way you will be able to immediately apply the knowledge you’ve acquired, which makes learning the fields like economics and mathematics much easier.

Gamified learning

One thing the last decade has shown us is how effective games are in capturing people’s attention and gluing them to their seats. That’s why some educators and psychologists have done extensive research to help bring over some aspects of gaming to education.

Correct use of gaming principles in a learning system will make it easier for you to focus on learning more, retain more of the information, and feel less fatigue after long studying sessions. While this method is still in its infancy, it is already showing great promise.

The show, don’t tell: How can you start a career as a data scientist

While choosing to opt out of enrolling in a university might prevent you from networking, and it is really hard for online resumes to help you stand out, there are new ways and platforms where you can show your skills!

Competition Sites

Competition sites like Kaggle provide an excellent training ground for budding data scientists to show their skills. They provide competition from diverse fields from economics to computer vision. The people who come up with the best algorithms not only get monetary rewards, but they have a great chance of getting job offers. Most employers will be impressed if you achieve good results in these competitions as it shows a practical understanding of the field beyond academics.

Github and Jupyter Notebooks

Github and Jupyter Notebook allows you to present data analyses in a readable and concise format. Instead of boring old CVs, employers are more receptive to a rich portfolio. Thanks to the tools being completely free and intuitive to use, you’re only limited by your skills when it comes to the projects you tackle. You can build an amazing portfolio from the comfort of your home.

Conclusion

The answer isn’t cut-and-dry, and while there have been some movements claiming universities have become completely redundant in 2018, there are still some real benefits to them. You should ask yourself if you’d thrive in an academic setting, if yes, then you’d probably see sizable benefits from attending university. On the other hand, the new approaches to learning and portfolio building have made it easier than ever to succeed on your own, and you can do it if you are motivated enough.

You might also like: Is it worth going to university anymore?

Stephanie Donahole
| May 29, 2019

Data Science is a hot topic in the job market these days. What are some of the best places for Data Scientists and Engineers to work in?

To be honest, there has never been a better time than today to learn data science. The job landscape is quite promising, opportunities span multiple industries, and the nature of the job often allows for remote work flexibility and even self-employment. The following post emphasizes the top cities across the globe with the highest pay packages for data scientists.

Industries across the globe keep diversifying on a constant basis. With technology reaching new heights and a majority of the population having unlimited access to an internet connection, there is no denying the fact that big data and data analytics have started gaining momentum over the years. Demand for data analytics professionals currently outweighs the supply, meaning that companies are willing to pay a premium to fill their open job positions. Further below, I would like to mention certain skills required for a job in data analytics.

Python

Being one of the most used programming languages, Python has a solid understanding of how it can be used for data analytics. Even if it’s not a required skill, knowledge and understanding of Python will give you an upper hand when showing future employers the value that you can bring to their companies. Just make sure you learn how to manipulate and analyze data, understand the concept of web scraping and data collection, and start building web applications.

SQL (Structured Query Language)

Like Python, SQL is a relatively easy language to start learning. Even if you are just getting started, a little SQL experience goes a long way. This will give you the confidence to navigate large databases, and to obtain and work with the data you need for your projects. You can always seek out opportunities to continue learning once you get your first job.

Data visualization

Regardless of the career path, you are looking into, it is crucial to visualize and communicate insights related to your company’s services, and is a valuable skill set that will capture the attention of employers. Data scientists are a bit like data translators for other people who exactly know what conclusions to draw from their datasets.

Best opportunities for a data scientist

Have a look at cities across the globe that offer the best opportunities for the position of a data scientist. The order of the cities does not represent any type of rank.

salary graph
Average Salary of a Data Scientist in US Dollars
  1. San Jose, California – Have you ever dreamed about working in Silicon Valley? Who hasn’t? It’s the dream destination of any tech enthusiast and an emerging hot spot for data scientists all across the globe. Being an international headquarters and main offices of the majority of the American tech corporations, it offers a plethora of job opportunities and high pay. It may interest you to know that the average salary of a chief data scientist is estimated to be $132,355 per year.
  2. Bengaluru, India – Second city on the list is Bengaluru, India. The analytics market is touted to be the best in the country, with the state government, analytics startups, and tech giants contributing substantially to the overall development of the sector. The average salary is estimated to be ₹ 12 lakh per annum ($17,240.40).
  3. Berlin, Germany – If we look at other European countries, Germany is home to some of the finest automakers and manufacturers. Although, the country isn’t much explored to newer and better opportunities in the field of data science it seems to be expanding its portfolio day in and day out. If you are a data scientist you may earn around €11,000 but if you are a chief data scientist then you will not be earning less than €114,155.
  4. Geneva, Switzerland – If you are seeking around for one of the highest paying cities in this beautiful paradise; it is Geneva. Call yourself fortunate, if you happen to land a position as a data scientist. The mean salary of a researcher starts at 180,000 Swiss Fr, and a chief data scientist can earn as much as 200,000 Swiss Fr with an average bonus ranging between 9,650-18,000 Swiss Fr.
  5. London, United Kingdom – One of the top destinations in Europe that offers high-paying and reputable jobs in London. UK government seems to rely on technologies day in and day out due to which the number of opportunities in the field has gone up substantially with the average salary of a Data Scientist being £61,543.

I also included the average data scientist salaries from the 20 largest cities around the world in 2019:

  1. Tokyo, Japan: $56,783
  2. New York City, USA: $115,815
  3. Mexico City, Mexico: $32,487
  4. Sao Paolo, Brazil: $45,891
  5. Los Angeles, USA: $120,179
  6. Shanghai, China: $66,014
  7. Mumbai, India: $29,695
  8. Seoul, South Korea: $45,993
  9. Osaka, Japan $54,417
  10. London, UK: $56,820
  11. Lagos, Nigeria: $48,771
  12. Calcutta, India: $7,423
  13. Buenos Aires, Argentina: $40,512
  14. Paris, France: $37,861
  15. Rio de Janeiro, Brazil: $54,191
  16. Karachi, Pakistan: $6,453
  17. Delhi, India: $20,621
  18. Manila, Philippines: $47,414
  19. Istanbul, Turkey: $30,210
  20. Beijing, China: $72,801
Rahim Rasool
| July 1, 2019

Kaggle Days Dubai is an event to improve your data science skillset. Here’s what you can expect to learn from the grandmasters.

Anyone interested in analytics or machine learning would certainly be aware of Kaggle. Kaggle is the world’s largest community of data scientists and offers companies to host prize money competitions for data scientists around the world to compete in. This has made it the largest online competition platform too. However, Kaggle has started to evolve itself to organize offline meetups globally.

One such initiative is the organization of Kaggle Days. Up till now, four Kaggle Days events have been organized in various cities around the world, the recent one being in Dubai. The format of Kaggle Days involves a 2-day session consisting of presentations, practical workshops, and brainstorming sessions during the first day followed by an offline competition the next day.

For a machine learning enthusiast with intermediate experience in this field, participating in a Kaggle hosted competition and teaming up with a Kaggle Grandmaster to compete against other grandmasters was an enjoyable experience on its own for me. I couldn’t reach the top ranks in the competition, but competing with and networking with the dozens of grandmasters and other enthusiasts present during the 2-day event boosted my learning and abilities.

My desire was to make the best use of this opportunity, learn to the utmost extent I could, and ask the right questions from the grandmasters present at the event to get the best out of their wisdom and learn the optimal ways to approach any data science problem. It was heart-whelming to discover how supportive they were as they shared tricks and advice to get to the top position in data science competitions, and improve the performance of any Machine Learning project. In this blog, I’d like to share the insights that I gathered during my conversations, and the noteworthy points I recorded during their presentations.

Strengthen your basic knowledge of Kaggle

My primary mentor during the offline competition was Yauhen Babakhin. Yauhen is a data scientist at H2O.ai and has worked on a range of domains including e-commerce, gaming, and banking, specializing in NLP related problems. An inspiring personality and one of the youngest Kaggle Grandmasters. Fortunately, I got the opportunity to network with him the most. His profile deluded my misconception that only someone with a doctoral degree can achieve the prestige of being a Grandmaster.

During our conversations, the most significant advice that came from Yauhen was to strengthen our basic knowledge and have an intuition about various machine learning concepts and algorithms. One does not need to go extensively deep into these concepts or does not need to be extra knowledgeable to begin with. As he said, “start learning a few important learning models but get to know how they work!” It will be ideal to start with the basics and extend your knowledge along the way by building experience through competitions, especially the ones hosted on Kaggle. For most of the queries, Yauhen suggests, one must know what to search on Google. This alone will prove to be an extremely handy tool on its own to get us through most of the problems despite having limited experience relative to our competitors.

Day-2-Kaggle-310--22-
Kaggle competition day 2

 

Furthermore, Yauhen emphasized on how Kaggle single-handedly played a leading role in heightening his skills. Throughout this period, he stressed on how challenges triggered him to perform better and learn more. It was such challenges that provoked him to learn beyond his current knowledge and explore areas beyond his specialization such as computer vision, said the winner of the $100,000 TGS Salt identification challenge. It was these challenges that prompted him to dive into various areas of machine learning and it was this trick that he suggested us to use to accelerate career growth.

Through this conversation, I was able to learn the importance of going broad. Though Yauhen insisted about selecting problems that target a broad range of problems and cover various aspects of Data Science, he also suggested to limit it to the extent that it should align with our career pursuits and realize even if we even need to target something beyond what we are ever going to use. Lastly, the Grandmaster in his late 20’s also wanted us to practice with deep learning models as it’ll allow us to target a broad set of problems and to discover the best approaches used by previous winners and to combine them in our projects or competition submissions. These approaches could be found in blogs, kernels and forum discussions.

Remain persistent

My next detailed interaction was with Abhishek Thakur. The conversation provoked me to ask as many questions as I could, as every suggestion given by Abhishek seemed wise and encouraging. One of the rare examples of someone crowned with 2 Kaggle Grandmaster titles, competition and discussion grandmaster, Abhishek is the chief data scientist at boost.ai, once attaining the 3rd rank in global competitions at Kaggle. What made his profile more convincing was Abhishek’s accelerated growth from a novice to a grandmaster within a time period of a year and a half. He started his career in machine learning from scratch and took this initiative from Kaggle itself. Initially starting off with the lowest rank in competitions, Abhishek was adamant that Kaggle could be the only platform one can totally rely on to catapult his growth within such a short period of time.

Day-1-Kaggle-292--17-
Abhishek speaking at Kaggle

 

However, as Abhishek repeatedly said, it all required continuous persistence. From the beginning till now, even after being placed in the bottom ranks initially, Abhishek carried on and demonstrated how persistence was the key to his success. Upon inquiring about the significant tools that led him to get golds in his recent participation, Thakur emphasized immensely on feature engineering. He insisted on how this step was the most important from all in distinguishing the winner. Similarly, he suggested that a thorough exploratory data analysis can assist one to find those magical features that can enable one to get the winning results.

Like other Grandmasters who have attained massive success in this domain, Abhishek also laid emphasis on improving one’s personal profile through Kaggle. Not only does it offer you a distinct and fast-paced learning experience, as it did for all the grandmasters at the event, but it’s also recognized across various industries and major employees who value these rankings. Abhishek told how it enabled him to get numerous lucrative job offers over time.

Start instantly with competitions

During the first day, I was able to attend Pavel Pleskov’s workshop on ‘Building The Ultimate Binary Classification Pipeline’. Based in Russia, Pavel currently works for an NLP startup, PointAPI, and was once ranked at number 2 among Kagglers globally. The workshop was fantastic, but the conversations during and after the workshop intrigued me the most as they mostly comprised of tips for beginners.

Someone who quit his profitable business to compete on Kaggle, Pavel insisted on the ‘do what you love’ strategy as it leads to more life satisfaction and profit. Pavel told us how he started off with some of the most popular online courses on machine learning but found them lacking practical skills and homeworks which he covered using Kaggle. For beginners, he strongly recommended not to put off Kaggle contests or wait until the completion of courses, but to start instantly. According to him, practical experience on Kaggle is more important than any other course assignment.

Some other noteworthy and touching tips from Pavel were that in order to win such competitions, unlike many students who approach Kaggle as an academic problem and start creating fancy architectures and ultimately do not score well, Pavel approaches a problem with a business mindset. He increased the probability of success by leveraging resources, such as including people in his team who had resources, like a GPU, or merging his team with another to improve the overall score.

Day-2-Kaggle-1--39-
Kaggle competition day 2

Upon an inquiry related to keeping the right balance between taking out time to build theoretical knowledge and using that time to generate new ideas, Pavel advised looking at forum threads on Kaggle. They can help you know how much theoretical knowledge you are missing while competing with others. Pavel is an avid user of LightGBM and CatBoost models, which he claims has given him superior rankings during the competitions. One of his suggestions is to use fast.ai library that, despite receiving many critical reviews, has been a flexible and useful library which he mostly keeps in consideration.

Hunt for ideas and rework them

Due to the limitation of time during the 2 days event, I was able to hear less from another young grandmaster from Russia, coincidentally sharing the same first name with his fellow Russian grandmaster, Pavel Ostyakov. Remarkably, Pavel was still an undergrad student then, and has been working for Yandex and Samsung AI for past couple of years.

Day-2-Kaggle-1--35--2

He brought a distinct set of advice that can prove to be extremely resourceful when one is targeting gold in competitions. He emphasized on writing clean code that could be used in the future and allows easy collaboration with other teammates, a practice usually overlooked which later becomes troubling for participants. He also insisted on trying to read as many forums on Kaggle as one can. Not just ones related to the same competition but those belonging from other competitions as well since most of them our similar. Apart from searching for workable solutions, Pavel suggested also looking for ideas that failed. As he recommended, one must try using (and reworking) those failed ideas as there are chances they may work.

Pavel also brought up the point that in order to surpass other competitors, reading research papers and implementing their solutions could increase your chances of success. However, during all this time he stressed a lot on to have a mindset that anyone can achieve gold in a competition, even if he/she possesses limited experience relative to others.

Experiment with diverse strategies

Other noteworthy tips and ideas that I collected while mingling with grandmasters and attending their presentations included those from Gilberto Titericz (Giba), the grandmaster from Brazil with 45 Gold medals! While personally inquiring Giba, he repeatedly used the key-word ‘experiment’ and insisted that it is always important to experiment with new strategies, methods and parameters. This is one simple, although tedious, way to learn quickly and get great results.

Day-3-Kaggle-1--35--2
Training session of Kaggle

Giba also proposed, that to attain top performance, one must build models using different viewpoints of the data. This diversity can come from feature engineering, using varying training algorithms or using different transformations. Therefore, one must explore all possibilities. Furthermore, Giba suggested that fitting a model using default hyperparameters is good enough to start a competition and build a benchmark score to improve further. Regarding teaming up, he repeated that diversity is the key here as well and choosing someone who thinks similar to you is not a good move.

A great piece of advice that came from Giba was to blend models. Combining models can help improve the performance of the final solution, especially if each model’s predictions has low correlation. A blend can be something simple as a weighted average. For instance, non-linear models like Gradient Boosting Machines blend very well with neural network based models.

Blending Models
Blending models suggested by Giba

Conclusion

Considering the key-takeaways from the suggestions given by these grandmasters and observing the way they competed during the offline competition, I noted that beginners in data science must use their effort to try varying methodologies as much as they can.  Moreover, a summary of the recommendations given above stress the significance of taking part in online competitions no matter how much knowledge or experience one possesses.

I also noted that most of the experienced data scientists were fond of using ensemble techniques and one of the most prominent methods used by them was the creation of new features out of the existing ones. In fact, this is what was cited by the winners of the offline competition as their strategy for success. Conclusively, these sorts of meetups could enable one to interact with the top minds in the field and gain the maximum within a short period of time as I fortunately did.

Related Topics

Web Development
Top
Statistics
Software Testing
Programming Language
Podcasts
Natural Language
Machine Learning
Hypothesis Testing
High-Tech
Events
Discussions
Demos
Data Visualization
Data Security
Data Science
Data Mining
Data Engineering
Data Analytics
Conferences

Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.