fbpx

machine learning

Master hyperparameter tuning for machine learning models
Ayesha Saleem
| March 28, 2023

Machine learning algorithms require the use of various parameters that govern the learning process. These parameters are called hyperparameters, and their optimal values are often unknown a priori. Hyperparameter tuning is the process of selecting the best values of these parameters to improve the performance of a model. In this article, we will explore the basics of hyperparameter tuning and the popular strategies used to accomplish it.  

Understanding hyperparameters 

In machine learning, a model has two types of parameters: Hyperparameters and learned parameters. The learned parameters are updated during the training process, while the hyperparameters are set before the training begins.

Hyperparameters control the model’s behavior, and their values are usually set based on domain knowledge or heuristics. Examples of hyperparameters include learning rate, regularization coefficient, batch size, and the number of hidden layers.

Learn about top 10 machine learning demos in detail 

Why is hyperparameter tuning important? 

The values of hyperparameters significantly affect the performance of a model. Suboptimal values can result in poor performance or overfitting, while optimal values can lead to better generalization and improved accuracy. In summary, hyperparameter tuning is crucial to maximizing the performance of a model. 

Hyperparameter tuning for ML models
Hyperparameter tuning for ML models

Strategies for hyperparameter tuning 

There are different strategies used for hyperparameter tuning, and some of the most popular ones are grid search and randomized search. 

Grid search: This strategy evaluates a range of hyperparameter values by exhaustively searching through all possible combinations of parameter values in a grid. The best combination is selected based on the model’s performance metrics.  

Randomized Search: This strategy evaluates a random set of hyperparameter values within a given range. This approach can be faster than grid search and can still produce good results. 

H3: general hyperparameter tuning strategy 

To effectively tune hyperparameters, it is crucial to follow a general strategy. According to, a general hyperparameter tuning strategy consists of three phases: 

  • Preprocessing and feature engineering 
  • Initial modeling and hyperparameter selection 
  • Refining hyperparameters 


Preprocessing and feature engineering
 

The first phase involves preprocessing and feature engineering. This includes data cleaning, data normalization, and feature selection. In this phase, hyperparameters that affect the preprocessing and feature engineering steps are set, such as the number of features to be selected. 

Initial modeling and hyperparameter selection 

The second phase involves initializing the model and selecting a range of hyperparameter values to test. This includes setting the model type and other model-specific hyperparameters, such as the learning rate or the number of hidden layers.  

Refining hyperparameters 

In the final phase, the hyperparameters are fine-tuned by adjusting their values based on the model’s performance metrics. This can be done using gridsearchcv, randomizedsearchcv, or other strategies. 

Most common questions asked about hyperparameters 

Q: Can hyperparameters be learned during training? 

A: No, hyperparameters are set before the training begins and are not updated during the training process.   

Q: Why is it necessary to set the hyperparameters? 

A: Hyperparameters control the learning process of a model, and their values can significantly affect its performance. Setting the hyperparameters helps to improve the model’s accuracy and prevent overfitting. 

Methods for hyperparameter tuning in machine learning

Hyperparameter tuning is an essential step in machine learning to fine-tune models and improve their performance. Several methods are used to tune hyperparameters, including grid search, random search, and bayesian optimization. Here’s a brief overview of each method:  

Ready to take your machine learning skills to the next level? Click on the video to learn more about building robust models.

1. Grid search:

Grid search is a commonly used method for hyperparameter tuning. In this method, a predefined set of hyperparameters is defined, and each combination of hyperparameters is tried to find the best set of values.

Grid search is suitable for small and quick searches of hyperparameter values that are known to perform well generally. However, it may not be an efficient method when the search space is large. 

2. Random search:

Unlike grid search, in a random search, only a part of the parameter values are tried out. In this method, the parameter values are sampled from a given list or specified distribution, and the number of parameter settings that are sampled is given by n_iter.

Random search is appropriate for discovering new hyperparameter values or new combinations of hyperparameters, often resulting in better performance, although it may take more time to complete. 

3. Bayesian optimization:

Bayesian optimization is a method for hyperparameter tuning that aims to find the best set of hyperparameters by building a probabilistic model of the objective function and then searching for the optimal values. This method is suitable when the search space is large and complex.

Bayesian optimization is based on the principle of Bayes’s theorem, which allows the algorithm to update its belief about the objective function as it evaluates more hyperparameters. This method can converge quickly and may result in better performance than grid search and random search.

Choosing the right method for hyperparameter tuning

In conclusion, hyperparameter tuning is essential in machine learning, and several methods can be used to fine-tune models. Grid search is a simple and efficient method for small search spaces, while the random search can be used for discovering new hyperparameter values.

Bayesian optimization is a powerful method for complex and large search spaces that can result in better performance by building a probabilistic model of the objective function. It’s choosing the right method based on the problem at hand is essential. 

Discovering MLOps – The key to efficient machine learning deployment
Ruhma Khawaja
| March 24, 2023

Ready to revolutionize the way you deploy machine learning? Look no further than MLOps – the future of ML deployment. Let’s take a step back and dive into the basics of this game-changing concept.

Machine Learning (ML) has become an increasingly valuable tool for businesses and organizations to gain insights and make data-driven decisions. However, deploying and maintaining ML models can be a complex and time-consuming process. 

What is MLOps?

MLOps, also known as ML Operations, is a set of practices and tools for streamlining the deployment, maintenance, and management of ML models in a production environment. The goal of MLOps is to ensure that models are reliable, secure, and scalable, while also making it easier for data scientists and engineers to develop, test, and deploy ML models. 

Key components of MLOps 

  • Automated Model Building and Deployment: Automated model building and deployment are essential for ensuring that models are accurate and up to date. This can be achieved with tools like continuous integration and deployment (CI/CD) pipelines, which automate the process of building, testing, and deploying models. 
  • Monitoring and Maintenance: ML models need to be monitored and maintained to ensure they continue to perform well and provide accurate results. This includes monitoring performance metrics, such as accuracy and recall, tracking and fixing bugs, and other issues. 
  • Data Management: Effective data management is crucial for ML models to work well. This includes ensuring that data is properly labeled and processed, managing data quality, and ensuring that the right data is used for training and testing models. 
  • Collaboration and Communication: Collaboration and communication between data scientists, engineers, and other stakeholders is essential for successful MLOps. This includes sharing code, documentation, and other information and providing regular updates on the status and performance of models. 
  • Security and Compliance: ML models must be secure and comply with regulations, such as data privacy laws. This includes implementing secure data storage, and processing, and ensuring that models do not infringe on privacy rights or compromise sensitive information. 

Advantages of MLOps 

The advantages of MLOps (Machine Learning Operations) are numerous and provide significant benefits to organizations that adopt this practice. Here are some of the key advantages: 

Advantages of MLOps
Advantages of MLOps – Data Science Dojo

1. Streamlined deployment: MLOps streamlines the deployment of ML models, making it faster and easier for data scientists and engineers to get their models into production. This helps to speed up the time to market for ML projects, which can have a major impact on an organization’s bottom line. 

2. Better accuracy of ML models: MLOps helps to ensure that ML models are reliable and accurate, which is critical for making data-driven decisions. This is achieved through regular monitoring and maintenance of the models and automated tools for building and deploying models. 

3. Collaboration boost between data scientists and engineers: MLOps promotes collaboration and communication between data scientists and engineers, which helps to ensure that models are developed and deployed effectively. This also makes it easier for teams to share code, documentation, and other information, which can lead to more efficient and effective development processes. 

4. Improves data management and compliance with regulations: MLOps helps to improve data management and ensure compliance with regulations, such as data privacy laws. This includes implementing secure data storage, and processing, and ensuring that models do not infringe on privacy rights or compromise sensitive information. 

5. Reduces the risk of errors: MLOps reduces the risk of errors and downtime in ML projects, which can have a major impact on an organization’s reputation and bottom line. This is achieved using automated tools for model building and deployment and through regular monitoring and maintenance of models. 

Best practices for implementing MLOps 

Best practices for implementing ML Ops (Machine Learning Operations) can help organizations to effectively manage the development, deployment, and maintenance of ML models. Here are some of the key best practices: 

  • Start with a solid data management strategy: A solid data management strategy is the foundation of MLOps. This includes developing data governance policies, implementing secure data storage and processing, and ensuring that data is accessible and usable by the teams that need it. 
  • Use automated tools for model building and deployment: Automated tools are critical for streamlining the development and deployment of ML models. This includes tools for model training, testing, and deployment, and for model version control and continuous integration. 
  • Monitor performance metrics regularly: Regular monitoring of performance metrics is an essential part of MLOps. This includes monitoring model performance, accuracy, stability, tracking resource usage, and other key performance indicators. 

  • Ensure data privacy and security: MLOps must prioritize data privacy and security, which includes ensuring that data is stored and processed securely and that models do not compromise sensitive information or infringe on privacy rights. This also includes complying with data privacy regulations and standards, such as GDPR (General Data Protection Regulation). 

By following these best practices, organizations can effectively implement MLOps and take full advantage of the benefits of ML. 

Wrapping up 

MLOps is a critical component of ML projects, as it helps organizations to effectively manage the development, deployment, and maintenance of ML models. By implementing ML Ops best practices, organizations can streamline their ML development and deployment processes, ensure that ML models are reliable and accurate, and reduce the risk of errors and downtime in ML projects. 

In conclusion, the importance of MLOps in ML projects cannot be overstated. By prioritizing MLOps, organizations can ensure that they are making the most of the opportunities that ML provides and that they are able to leverage ML to drive growth and competitiveness successfully.

Handling imbalanced data: 7 innovative techniques for successful analysis
Ayesha Saleem
| March 21, 2023

Imbalanced data is a common problem in machine learning, where one class has a significantly higher number of observations than the other. This can lead to biased models and poor performance on the minority class. In this blog, we will discuss techniques for handling imbalanced data and improving model performance.   

Understanding imbalanced data 

Imbalanced data refers to datasets where the distribution of class labels is not equal, with one class having a significantly higher number of observations than the other. This can be a problem for machine learning algorithms, as they can be biased towards the majority class and perform poorly on the minority class. 

Techniques for handling imbalanced data

Dealing with imbalanced data is a common problem in data science, where the target class has an uneven distribution of observations. In classification problems, this can lead to models that are biased toward the majority class, resulting in poor performance of the minority class. To handle imbalanced data, various techniques can be employed. 

How to handle imbalanced data
How to handle imbalanced data – Data Science Dojo

 1. Resampling techniques

Resampling techniques involve modifying the original dataset to balance the class distribution. This can be done by either oversampling the minority class or undersampling the majority class. 

Oversampling techniques include random oversampling, synthetic minority over-sampling technique (SMOTE), and adaptive synthetic (ADASYN). Undersampling techniques include random undersampling, nearmiss, and tomek links. 

An example of a resampling technique is bootstrap resampling, where you generate new data samples by randomly selecting observations from the original dataset with replacements. These new samples are then used to estimate the variability of a statistic or to construct a confidence interval.  

For instance, if you have a dataset of 100 observations, you can draw 100 new samples of size 100 with replacement from the original dataset. Then, you can compute the mean of each new sample, resulting in 100 new mean values. By examining the distribution of these means, you can estimate the standard error of the mean or the confidence interval of the population mean. 

2. Data augmentation

Data augmentation involves creating additional data points by modifying existing data. This can be done by applying various transformations such as rotations, translations, and flips to the existing data.

Read about top statistical techniques in this blog  

3. Synthetic minority over-sampling technique (SMOTE)

SMOTE is a type of oversampling technique that involves creating synthetic examples of the minority class by interpolating between existing minority class examples.

4. Ensemble techniques

Ensemble techniques involve combining multiple models to improve performance. This can be done by using techniques such as bagging, boosting, and stacking.

5. One-class classification

One-class classification involves training a model on only one class and then using it to identify data points that do not belong to that class. This can be useful for identifying anomalies and outliers in the data.

6. Cost-sensitive learning

Cost-sensitive learning involves adjusting the cost of misclassifying data points to account for the class imbalance. This can be done by assigning a higher cost to misclassifying the minority class, which encourages the model to prioritize correctly classifying the minority class.

7. Evaluation metrics for imbalanced data

Evaluation metrics such as precision, recall, and F1 score can be used to evaluate the performance of models on imbalanced data. Additionally, metrics such as the area under the receiver operating characteristic curve (AUC-ROC) and the area under the precision-recall curve (AUC-PR) can also be used. 

Choosing the best technique for handling imbalanced data 

After discussing techniques for handling imbalanced data, we learned several approaches that can be used to address the issue. The most common techniques include undersampling, oversampling, and feature selection. 

Undersampling involves reducing the size of the majority class to match that of the minority class, while oversampling involves creating new instances of the minority class to balance the data. Feature selection is the process of selecting only the most relevant features to reduce the noise in the data.  

In conclusion, it is recommended to use both undersampling and oversampling techniques to balance the data, with oversampling being the most effective. However, the choice of technique will ultimately depend on the specific characteristics of the dataset and the problem at hand. 

Boost your MLOps efficiency with these 6 must-have tools and platforms
Ayesha Saleem
| February 20, 2023

Are you struggling with managing MLOps tools? In this blog, we’ll show you how to boost your MLOps efficiency with 6 essential tools and platforms. These tools will help you streamline your machine learning workflow, reduce operational overheads, and improve team collaboration and communication.

Machine learning (ML) is the technology that automates tasks and provides insights. It allows data scientists to build models that can automate specific tasks. It comes in many forms, with a range of tools and platforms designed to make working with ML more efficient. It is used by businesses across industries for a wide range of applications, including fraud prevention, marketing automation, customer service, artificial intelligence (AI), chatbots, virtual assistants, and recommendations. Here are the best tools and platforms for MLOps professionals: 

Watch the complete MLOps crash course and add to your knowledge of developing machine learning models. 

Apache Spark 

Apache Spark is an in-memory distributed computing platform. It provides a large cluster of clusters on a single machine. Spark is a general-purpose distributed data processing engine that can handle large volumes of data for applications like data analysis, fraud detection, and machine learning. It features an ML package with machine learning-specific APIs that enable the easy creation of ML models, training, and deployment.  

With Spark, you can build various applications including recommendation engines, fraud detection, and decision support systems. Spark has become the go-to platform for an impressive range of industries and use cases. It excels with large volumes of data in real-time. It offers an affordable price point and is an easy-to-use platform. Spark is well suited to applications that involve large volumes of data, real-time computing, model optimization, and deployment.  

Read about Apache Zeppelin: Magnum Opus of MLOps in detail 

AWS SageMaker 

AWS SageMaker is an AI service that allows developers to build, train and manage AI models. SageMaker boosts machine learning model development with the power of AWS, including scalable computing, storage, networking, and pricing. It offers a complete end-to-end solution, including development tools, execution environments, training models, and deployment.  

AWS SageMaker provides managed services, including model management and lifecycle management using a centralized, debugged model. It also has a model marketplace for customers to choose from a range of models, including custom ones.  

AWS SageMaker also has a CLI for model creation and management. While the service is currently AWS-only, it supports both S3 and Glacier storage. AWS SageMaker is great for building quick models and is a good option for prototyping and testing. It is also useful for training models on smaller datasets. AWS SageMaker is useful for creating basic models, including regression, classification, and clustering. 

Best tools and platforms for MLOPs
Best tools and platforms for MLOPs – Data Science Dojo

Google Cloud Platform 

Google Cloud Platform is a comprehensive offering of cloud computing services. It offers a range of products, including Google Cloud Storage, Google Cloud Deployment Manager, Google Cloud Functions, and others.  

Google Cloud Platform is designed for building large-scale, mission-critical applications. It provides enterprise-class services and capabilities, such as on-demand infrastructure, network, and security. It also offers managed services, including managed storage and managed computing. Google Cloud Platform is a great option for businesses that need high-performance computing, such as data science, AI, machine learning, and financial services. 

Microsoft Azure Machine Learning 

Microsoft Azure Machine Learning is a set of tools for creating, managing, and analyzing models. It has prebuilt models that can be used for training and testing. Once a model is trained, it can be deployed as a web service. 

It also offers tools for creating models from scratch. Machine Learning is a set of techniques that allow computers to make predictions based on data without being programmed to do so. It uses algorithms to find patterns and make predictions based on the data, such as predicting what a user will click on.

Azure Machine Learning has a variety of prebuilt models, such as speech, language, image, and recommendation models. It also has tools for creating custom models. Azure Machine Learning is a great option for businesses that want to rapidly build and deploy predictive models. It is also well suited to model management, including deploying, updating, and managing models.  

Databricks 

Next up in the MLOps efficiency list. we have Databricks which is an open-source, next-generation data management platform. It focuses on two aspects of data management: ETL (extract-transform-load) and data lifecycle management. It has built-in support for machine learning.  

It allows users to design data pipelines, such as extracting data from various sources, transforming that data, and loading it into data storage engines. It also has ML algorithms built into the platform. It provides a variety of tools for data engineering, including model training and deployment. It has built-in support for different machine-learning algorithms, such as classification and regression. Databricks is a good option for business users that want to use machine learning quickly and easily. It is also well suited to data engineering tasks, such as vectorization and model training. 

TensorFlow Extended (TFX) 

TensorFlow is an open-source platform for implementing ML models. TensorFlow offers a wide range of ready-made models for various tasks, along with tools for designing and training models. It also has support for building custom models.  

TensorFlow offers a wide range of models for different tasks, such as speech and language processing, computer vision, and natural language understanding. It has support for a wide range of formats, including CSV, JSON, and HDFS.

TensorFlow also has a large library of machine learning models, such as neural networks, regression, probabilistic models, and collaborative filtering. TensorFlow is a powerful tool for data scientists. It also provides a wide range of ready-made models, making it an easy-to-use platform. TensorFlow is easy to use and comes with many models and algorithms. It has a large community, which makes it a reliable tool.

Key Takeaways 

Machine learning is one of the most important technologies in modern businesses. But finding the right tool and platform can be difficult. To help you with your decisions, here’s a list of the best tools and platforms for MLOps professionals. It is a technology that automates tasks and provides insights. It allows data scientists to build models that can automate specific tasks. ML comes in many forms, with a range of tools and platforms designed to make working with ML more efficient. 

 

5 tips to develop successful machine learning projects
Kelly Moser
| January 25, 2023

Machine learning is the way of the future. Discover the importance of data collection, finding the right skill sets, performance evaluation, and security measures to optimize your next machine learning project. 

(more…)

Top 10 Machine Learning demos of 2022 from Data Science Dojo
Ali Mohsin
| December 28, 2022

In this blog, we will have a look at the list of top 10 Machine Learning Demos offered by Data Science Dojo that will provide ease to use ML (Machine Learning) techniques free.  

(more…)

Guest blog
| November 22, 2022

With the surge in demand and interest in AI and machine learning, many contemporary trends are emerging in this space. As a tech professional, this blog will excite you to see what’s next in the realm of Artificial Intelligence and Machine Learning trends.

 

emerging-AI-and-machine-learning-trends
Emerging AI and machine learning trends

Data security and regulations 

In today’s economy, data is the main commodity. To rephrase, intellectual capital is the most precious asset that businesses must safeguard. The quantity of data they manage, as well as the hazards connected with it, is only going to expand after the emergence of AI and ML. Large volumes of private information are backed up and archived by many companies nowadays, which poses a growing privacy danger. Don Evans, CEO of Crewe Foundation   

data_security

The future currency is data. In other words, it’s the most priceless resource that businesses must safeguard. The amount of data they handle, and the hazards attached to it will only grow when AI and ML are brought into the mix. Today’s businesses, for instance, back up and store enormous volumes of sensitive customer data, which is expected to increase privacy risks by 2023.
 

Overlap of AI and IoT 

There is a blurring of boundaries between AI and the Internet of Things. While each technology has merits of its own, only when they are combined can they offer novel possibilities? Smart voice assistants like Alexa and Siri only exist because AI and the Internet of Things have come together. Why, therefore, do these two technologies complement one another so well?

The Internet of Things (IoT) is the digital nervous system, while Artificial Intelligence (AI) is the decision-making brain. AI’s speed at analyzing large amounts of data for patterns and trends improves the intelligence of IoT devices. As of now, just 10% of commercial IoT initiatives make use of AI, but that number is expected to climb to 80% by 2023. Josh Thill, Founder of Thrive Engine 

AI ethics: Understanding biased AI and associated ethical dilemmas 
AI ethics: Understanding biased AI and associated ethical dilemmas

Why then do these two technologies complement one other so well? IoT and AI can be compared to the brain and nervous system of the digital world, respectively. IoT systems have become more sophisticated thanks to AI’s capacity to quickly extract insights from data. Software developers and embedded engineers now have another reason to include AI/ML skills in their resumes because of this development in AI and machine learning. 

 

Augmented Intelligence   

The growth of augmented intelligence should be a relieving trend for individuals who may still be concerned about AI stealing their jobs. It combines the greatest traits of both people and technology, offering businesses the ability to raise the productivity and effectiveness of their staff.

40% of infrastructure and operations teams in big businesses will employ AI-enhanced automation by 2023, increasing efficiency. Naturally, for best results, their staff should be knowledgeable in data science and analytics or have access to training in the newest AI and ML technologies. 

Moving on from the concept of Artificial Intelligence to Augmented Intelligence, where decisions models are blended artificial and human intelligence, where AI finds, summarizes, and collates information from across the information landscape – for example, company’s internal data sources. This information is presented to the human operator, who can make a human decision based on that information. This trend is supported by recent breakthroughs in Natural Language Processing (NLP) and Natural Language Understanding (NLU). Kuba Misiorny, CTO of Untrite Ltd
 

Transparency 

Despite being increasingly commonplace, there are trust problems with AI. Businesses will want to utilize AI systems more frequently, and they will want to do so with greater assurance. Nobody wants to put their trust in a system they don’t fully comprehend.

As a result, in 2023 there will be a stronger push for the deployment of AI in a visible and specified manner. Businesses will work to grasp how AI models and algorithms function, but AI/ML software providers will need to make complex ML solutions easier for consumers to understand.

The importance of experts who work in the trenches of programming and algorithm development will increase as transparency becomes a hot topic in the AI world. 

Composite AI 

Composite AI is a new approach that generates deeper insights from any content and data by fusing different AI technologies. Knowledge graphs are much more symbolic, explicitly modeling domain knowledge and, when combined with the statistical approach of ML, create a compelling proposition. Composite AI expands the quality and scope of AI applications and, as a result, is more accurate, faster, transparent, and understandable, and delivers better results to the user. Dorian Selz, CEO of Squirro

It’s a major advance in the evolution of AI and marrying content with context and intent allows organizations to get enormous value from the ever-increasing volume of enterprise data. Composite AI will be a major trend for 2023 and beyond. 

Continuous focus on healthcare

There has been concern that AI will eventually replace humans in the workforce ever since the concept was first proposed in the 1950s. Throughout 2018, a deep learning algorithm was constructed that demonstrated accurate diagnosis utilizing a dataset consisting of more than 50,000 normal chest pictures and 7,000 scans that revealed active Tuberculosis. Since then, I believe that the healthcare business has mostly made use of Machine Learning (ML) and Deep Learning applications of artificial intelligence. Marie Ysais, Founder of Ysais Digital Marketing

Learn more about the role of AI in healthcare:

AI in healthcare has improved patient care

 

Pathology-assisted diagnosis, intelligent imaging, medical robotics, and the analysis of patient information are just a few of the many applications of artificial intelligence in the healthcare industry. Leading stakeholders in the healthcare industry have been presented with advancements and machine-learning models from some of the world’s largest technology companies. Next year, 2023, will be an important year to observe developments in the field of artificial intelligence.
 

Algorithmic decision-making 

Advanced algorithms are taking on the skills of human doctors, and while AI may increase productivity in the medical world, nothing can take the place of actual doctors. Even in robotic surgery, the whole procedure is physician-guided. AI is a good supplement to physician-led health care. The future of medicine will be high-tech with a human touch.  

 

No-code tools   

The low-code/No Code ML revolution accelerates creating a new breed of Citizen AI. These tools fuel mainstream ML adoption in businesses that were previously left out of the first ML wave (mostly taken advantage of by BigTech and other large institutions with even larger resources). Maya Mikhailov Founder of Savvi AI 

Low-code intelligent automation platforms allow business users to build sophisticated solutions that automate tasks, orchestrate workflows, and automate decisions. They offer easy-to-use, intuitive drag-and-drop interfaces, all without the need to write a line of code. As a result, low-code intelligent automation platforms are popular with tech-savvy business users, who no longer need to rely on professional programmers to design their business solutions. 

 

Cognitive analytics 

Cognitive analytics is another emerging trend that will continue to grow in popularity over the next few years. The ability for computers to analyze data in a way that humans can understand is something that has been around for a while now but is only recently becoming available in applications such as Google Analytics or Siri—and it’ll only get better from here! 

 

Virtual assistants 

Virtual assistants are another area where NLP is being used to enable more natural human-computer interaction. Virtual assistants like Amazon Alexa and Google Assistant are becoming increasingly common in homes and businesses. In 2023, we can expect to see them become even more widespread as they evolve and improve. Idrees Shafiq-Marketing Research Analyst at Astrill

virtual reality

Virtual assistants are becoming increasingly popular, thanks to their convenience and ability to provide personalized assistance. In 2023, we can expect to see even more people using virtual assistants, as they become more sophisticated and can handle a wider range of tasks. Additionally, we can expect to see businesses increasingly using virtual assistants for customer service, sales, and marketing tasks.
 

Information security (InfoSec)

The methods and devices used by companies to safeguard information fall under the category of information security. It comprises settings for policies that are essentially designed to stop the act of stopping unlawful access to, use of, disclosure of, disruption of, modification of, an inspection of, recording of, or data destruction.

With AI models that cover a broad range of sectors, from network and security architecture to testing and auditing, AI prediction claims that it is a developing and expanding field. To safeguard sensitive data from potential cyberattacks, information security procedures are constructed on the three fundamental goals of confidentiality, integrity, and availability, or the CIA. Daniel Foley, Founder of Daniel Foley SEO 

 

Wearable devices 

The continued growth of the wearable market. Wearable devices, such as fitness trackers and smartwatches, are becoming more popular as they become more affordable and functional. These devices collect data that can be used by AI applications to provide insights into user behavior. Oberon, Founder, and CEO of Very Informed 

 

Process discovery

It can be characterized as a combination of tools and methods with heavy reliance on artificial intelligence (AI) and machine learning to assess the performance of persons participating in the business process. In comparison to prior versions of process mining, these goes further in figuring out what occurs when individuals interact in different ways with various objects to produce business process events.

The methodologies and AI models vary widely, from clicks of the mouse for specific reasons to opening files, papers, web pages, and so forth. All of this necessitates various information transformation techniques. The automated procedure using AI models is intended to increase the effectiveness of commercial procedures. Salim Benadel, Director at Storm Internet

 

Robotic Process Automation, or RPA. 

An emerging tech trend that will start becoming more popular is Robotic Process Automation or RPA. It is like AI and machine learning, and it is used for specific types of job automation. Right now, it is primarily used for things like data handling, dealing with transactions, processing/interpreting job applications, and automated email responses. It makes many businesses processes much faster and more efficient, and as time goes on, increased processes will be taken over by RPA. Maria Britton, CEO of Trade Show Labs 

Robotic process automation is an application of artificial intelligence that configures a robot (software application) to interpret, communicate and analyze data. This form of artificial intelligence helps to automate partially or fully manual operations that are repetitive and rule based. Percy Grunwald, Co-Founder of Hosting Data 

 

Generative AI 

Most individuals say AI is good for automating normal, repetitive work. AI technologies and applications are being developed to replicate creativity, one of the most distinctive human skills. Generative AI algorithms leverage existing data (video, photos, sounds, or computer code) to create new, non-digital material.

Deepfake films and the Metaphysic act on America’s Got Talent have popularized the technology. In 2023, organizations will increasingly employ it to manufacture fake data. Synthetic audio and video data can eliminate the need to record film and speech on video. Simply write what you want the audience to see and hear, and the AI creates it. Leonidas Sfyris 

With the rise of personalization in video games, new content has become increasingly important. Companies are not able to hire enough artists to constantly create new themes for all the different characters so the ability to put in a concept like a cowboy and then the art assets created for all their characters becomes a powerful tool. 

 

Observability in practice

By delving deeply into contemporary networked systems, Applied Observability facilitates the discovery and resolution of issues more quickly and automatically. Applied observability is a method for keeping tabs on the health of a sophisticated structure by collecting and analyzing data in real time to identify and fix problems as soon as they arise.

Utilize observability for application monitoring and debugging. Telemetry data including logs, metrics, traces, and dependencies are collected by Observability. The data is then correlated in actuality to provide responders with full context for the incidents they’re called to. Automation, machine learning, and artificial intelligence (AIOps) might be used to eliminate the need for human interaction in problem-solving. Jason Wise, Chief Editor at Earthweb 

 

Natural Language Processing 

As more and more business processes are conducted through digital channels, including social media, e-commerce, customer service, and chatbots, NLP will become increasingly important for understanding user intent and producing the appropriate response.
 

Read more about NLP tasks and techniques in this blog:

Natural Language Processing – Tasks and techniques

 

In 2023, we can expect to see increased use of Natural Language Processing (NLP) for communication and data analysis. NLP has already seen widespread adoption in customer service chatbots, but it may also be utilized for data analysis, such as extracting information from unstructured texts or analyzing sentiment in large sets of customer reviews. Additionally, deep learning algorithms have already shown great promise in areas such as image recognition and autonomous vehicles.

In the coming years, we can expect to see these algorithms applied to various industries such as healthcare for medical imaging analysis and finance for stock market prediction. Lastly, the integration of AI tools into various industries will continue to bring about both exciting opportunities and ethical considerations. Nicole Pav, AI Expert.  

 

 Do you know any other AI and Machine Learning trends

Share with us in comments if you know about any other trending or upcoming AI and machine learning.

 

Guest blog
| November 15, 2022

In this blog, we have gathered the top 10 machine learning books. Learning this subject is a challenge for beginners. Take your learning experience one step ahead with these top-rated ML books on Amazon. 

Top 10 Machine learning books
Top 10 Machine learning books – Data Science dojo

1. Machine Learning: 4 Books in 1

Machine learning - 4 books in 1
Machine learning – 4 books in 1 by Samuel Hack

Machine Learning: 4 Books in 1 is a complete guide for beginners to master the basics of Python programming and understand how to
build artificial intelligence through data science. This book includes four books: Introduction to Machine Learning, Python Programming for
Beginners, Data Science for Beginners, and Artificial Intelligence for Beginners. It covers everything you need to know about machine learning, including supervised and unsupervised learning, regression and classification, feature engineering, model selection, and more. Muhammad Junaid – Marketing manager, BTIP

With clear explanations and practical examples, this book will help you quickly learn the essentials of machine learning and start building your own AI applications.

2. Mathematics for Machine Learning

Mathematics for machine learning
Mathematics for machine learning

Mathematics for Machine Learning is a tool that helps you understand the mathematical foundations of machine learning, so that you
can build better models and algorithms. It covers topics such as linear algebra, probability, optimization, and statistics. With this book, you
will be able to learn the mathematics needed to develop machine learning models and algorithms. Daniel – Founder, Gadget FAQs

This book is excellent for brushing up your mathematics knowledge required for ML. It is very concise while still providing enough details to help readers determine important parts. This is the go-to if you need to review some concepts or brush up on my knowledge in general.

This book is not recommended if you have absolutely no prior math experience though as it can be hard to digest and sometimes, they would skip parts here and there in proofs and examples. Especially for the probability section, the concepts will be very hard to grasp without prior knowledge

3. Linear Algebra and Optimization for Machine Learning

Linear algebra for Machine learning
Linear algebra for Machine learning

This textbook provides a comprehensive introduction to linear algebra and optimization, two fundamental topics in machine learning. It
covers both theory and applications and is suitable for students with little or no background in mathematics. Allan McNabb, VP – Image Building Media

The book begins with a review of basic linear algebra, before moving on to more advanced topics such as matrix decompositions, eigenvalues and eigenvectors, singular value decomposition, and least squares methods. Optimization techniques are then introduced, including gradient descent, Newton’s Method, conjugate gradient methods, and interior point methods.

4. The Hundred-Page Machine Learning Book

hundred-page machine learning
Hundred page machine learning book

If we have to teach machine learning to someone in juts few weeks, it is a lot better not to bother starting from scratch, instead hand over this book to the learners, because no doubt Andriy Burkov does a better job than we could do to quickly teach this vast subject in a limited time.

The book has a litany of rave reviews from some of the biggest names in tech, with scores more five-star reviews to boot, and you can see why. Burkov keeps his lessons concise and as easy to understand as possible given the subject matter, but still drills down into the details where necessary. Overall, the book excels at linking together complicated and sometimes seemingly unrelated concepts into a coherent whole. Peter, CEO and founder – Lantech

The book is very well organized, giving the reader an introduction and discussion on the mathematical notation used, a well written chapter that discusses several quite common algorithms, talks about best practices (like feature engineering, breaking up the data into multiple sets, and tuning the model’s hyperparameters), digs deeper into supervised learning, discusses unsupervised learning, and gives you a taste of a variety of other related topics.

This is a well-rounded book, far more so than most books I’ve read on machine learning or artificial intelligence. After reading through this, you will feel like you can competently discuss the subject, read one of the simpler machine learning research papers, and not be totally lost on the mathematics involved. The language used is concise and reads very well, showing very tight editing

5. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron

hands-on machine learning book
Hands-on machine learning book

It’s good for new programmers without over-simplifying. I’d recommend it for really getting into practice exercises. It’s a book you need to take your time with, but you’ll learn a lot from it. One thing observed by the learners of this book as a con is that the quality of the print varies, but the quality of its content makes it more than worth it. Chris Martinez – Founder of Idiomatic

6. Machine Learning for Absolute Beginners by Oliver Theobald

Machine learning for beginners
Machine learning for beginners by Oliver Theobald

Machine Learning is easy only when you have the right teacher and an appropriate reference book. Most of us fail to understand the importance of simple concepts that help us understand complex ones. Therefore, I recommend using Oliver Theobald’s *Machine Learning for Absolute Beginners *as the base reference book. Layla Acharya – Owner at Edwize

This book uses simple language to explain to the reader and teaches Machine learning from the scratch. Although non-technical people will find this book more relatable, people wanting to make a career in the machine learning field can benefit equally. It also has good references that can help a person who wants to learn like an expert.

7. Deep Learning for Coders with Fastai and PyTorch: AI Applications Without a PhD by Jeremy Howard and Sylvain Gugger

Deep learning for coders
Deep learning for coders with fastai and PyTorch

This book is very well-rated and it’s helped me a lot in understanding the basics of deep learning.

The main reason readers suggest this book is because it’s very accessible and easy to follow. As the authors themselves say, you don’t need a PhD to understand and use the concepts in the book, and it follows a top-down approach (starting with the applications and working backwards to the theory). So, you’ll first have fun with building cool applications and then gradually learn the underlying theory as you go. Ed Shway – Owner & Writer at ByteXD.com

Fast AI have kept updating their courses and library, so you might want to check out their website (https://www.fast.ai/) for the latest and greatest Just this July they released a latest version of the course that the book is associated with (https://course.fast.ai/).

Furthermore, the book also comes in a free online version https://github.com/fastai/fastbook. Since the *Fast AI team put all this effort and made every resource available for free, you can be sure they’re in it for the love of the game and to help the community*, rather than to make a quick buck. So, this book is definitely worth your time.

The first practical applications it teaches you is in computer vision – you’ll build an image classifier, which you can use to tell apart different
kinds of images. For example, you can use it to distinguish between different kinds of animals. It will be very easy to follow along and build
this classifier yourself.

 

8. Bayesian Reasoning and Machine Learning by David Barber

Bayesian reasoning and machine learning book
Bayesian reasoning and machine learning book

It’s a real must-have for beginners interested in deepening their knowledge of machine learning in an engaging way. The book covers topics such as dynamic and probabilistic models, approximate interference, graphical models, Naive Bayes algorithms, and more. What makes it worth checking out is the fact that the book is full of examples and exercises, which makes it a hands-on guide full of useful practice rather than dry theoretical frameworks. Marcin Gwizdala – Chief Technical Officer – Tidio

For relative beginners, Bayesian techniques began in the 1700s to model how a degree of belief should be modified to account for new evidence. The techniques and formulas were largely discounted and ignored until the modern era of computing, pattern recognition and AI, now machine learning.

The formula answers how the probabilities of two events are related when represented inversely, and more broadly, gives a precise mathematical model for the inference process itself (under uncertainty), where deductive reasoning and logic becomes a subset (under certainty, or when values can resolve to 0/1 or true/false, yes/no etc. In “odds” terms (useful in many fields including optimal expected utility functions in decision theory), posterior odds = prior odds * the Bayes Factor.

9. Deep Learning with PyTorch: Build, train, and tune neural networks using Python tools by Eli Stevens, Luca Antiga, Thomas Viehmann

Deep learning with Pytorch
Deep learning with Pytorch

This book provides a good and fairly complete description of the basic principles and abstractions of one of the most popular frameworks for
Machine Learning – PyTorch.

It’s great that this book is written by the creator and key contributors of PyTorch, unlike many books that claim to be a definitive treatise, it is not overloaded with non-essential details, the emphasis is on making the book practical. The book gives a reader a deep understanding of the framework and methods for building and training models on it (with advanced best practices) describing what is under the hood. Vitalii Kudelia, TUTU – Machine Learning Scientist

There is an example of solving a real-world problem in this book, it analyzes the problem of searching for malignant tumors on a computer
diagram with an analysis of approaches, possible errors, options for improvements, and provides code examples.

It also includes options for translating the model into production, using the models in other programming languages, and on mobile devices.
As a result, the book is highly useful for understanding and mastering the framework. Mastering PyTorch helps not only in computer vision, but also in other areas of deep learning, such as, for example, natural language processing.

10. Introduction to Machine Learning by Ethem Alpaydin

Intro to machine learning
Intro to machine learning book by Ethem Alpaydin

This comprehensive text covers everything from the basics of linear algebra to more advanced topics like support vector machines. In addition to being an excellent resource for students, Alpaydin’s book is also very accessible for practitioners who want to learn more about this exciting field. Rajesh Namase – Co-Founder and Tech Blogger

For learners, this is the best book for machine learning for a number of reasons. First, the book provides a clear and concise introduction to the basics of machine learning. Second, it covers a wide range of topics in machine learning, including supervised and unsupervised learning, feature selection, and model selection.

Third, the book is well-written and easy to understand. Finally, the book includes exercises and solutions at the end of each
chapter, which is extremely helpful for readers who want to learn more about machine learning.

 

Share more machine learning books with us 

If you have read any other interesting machine learning books, share with us in the comments below and let us help the learners to begin with computer vision. 

Data Science vs AI – What 2023 demand for?
Lafond Wanda
| November 10, 2022

Most people have heard the terms “data science” and “AI” at least once in their lives. Indeed, both of these are extremely important in the modern world as they are technologies that help us run quite a few of our industries. 

But even though data science and Artificial Intelligence are somewhat related to one another, they are still very different. There are things they have in common which is why they are often used together, but it is crucial to understand their differences as well. 

What is Data Science? 

As the name suggests, data science is a field that involves studying and processing data in big quantities using a variety of technologies and techniques to detect patterns, make conclusions about the data, and help in the decision-making process. Essentially, it is an intersection of statistics and computer science largely used in business and different industries. 

Artificial Intelligence (AI) vs Data science vs Machine learning
Artificial Intelligence vs Data science vs Machine learning – Image source

The standard data science lifecycle includes capturing data and then maintaining, processing, and analyzing it before finally communicating conclusions about it through reporting. This makes data science extremely important for analysis, prediction, decision-making, problem-solving, and many other purposes. 

What is Artificial Intelligence? 

Artificial Intelligence is the field that involves the simulation of human intelligence and the processes within it by machines and computer systems. Today, it is used in a wide variety of industries and allows our society to function as it currently does by using different AI-based technologies. 

Some of the most common examples in action include machine learning, speech recognition, and search engine algorithms. While AI technologies are rapidly developing, there is still a lot of room for their growth and improvement. For instance, there is no powerful enough content generation tool that can write texts that are as good as those written by humans. Therefore, it is always preferred to hire an experienced writer to maintain the quality of work.  

What is Machine Learning? 

As mentioned above, machine learning is a type of AI-based technology that uses data to “learn” and improve specific tasks that a machine or system is programmed to perform. Though machine learning is seen as a part of the greater field of AI, its use of data puts it firmly at the intersection of data science and AI. 

Similarities between Data Science and AI 

By far the most important point of connection between data science and Artificial Intelligence is data. Without data, neither of the two fields would exist and the technologies within them would not be used so widely in all kinds of industries. In many cases, data scientists and AI specialists work together to create new technologies or improve old ones and find better ways to handle data. 

As explained earlier, there is a lot of room for improvement when it comes to AI technologies. The same can be somewhat said about data science. That’s one of the reasons businesses still hire professionals to accomplish certain tasks like custom writing requirements, design requirements, and other administrative work.  

Differences between Data Science and AI 

There are quite a few differences between both. These include: 

  • Purpose – It aims to analyze data to make conclusions, predictions, and decisions. Artificial Intelligence aims to enable computers and programs to perform complex processes in a similar way to how humans do. 
  • Scope – This includes a variety of data-related operations such as data mining, cleansing, reporting, etc. It primarily focuses on machine learning, but there are other technologies involved too such as robotics, neural networks, etc. 
  • Application – Both are used in almost every aspect of our lives, but while data science is predominantly present in business, marketing, and advertising, AI is used in automation, transport, manufacturing, and healthcare. 

Examples of Data Science and Artificial Intelligence in use 

To give you an even better idea of what data science and Artificial Intelligence are used for, here are some of the most interesting examples of their application in practice: 

  • Analytics – Analyze customers to better understand the target audience and offer the kind of product or service that the audience is looking for. 
  • Monitoring – Monitor the social media activity of specific types of users and analyze their behavior. 
  • PredictionAnalyze the market and predict demand for specific products or services in the nearest future. 
  • Recommendation – Recommend products and services to customers based on their customer profiles, buying behavior, etc. 
  • Forecasting – Predict the weather based on a variety of factors and then use these predictions for better decision-making in the agricultural sector. 
  • Communication – Provide high-quality customer service and support with the help of chatbots. 
  • Automation – Automate processes in all kinds of industries from retail and manufacturing to email marketing and pop-up on-site optimization. 
  • Diagnosing – Identify and predict diseases, give correct diagnoses, and personalize healthcare recommendations. 
  • Transportation – Use self-driving cars to get where you need to go. Use self-navigating maps to travel. 
  • Assistance – Get assistance from smart voice assistants that can schedule appointments, search for information online, make calls, play music, and more. 
  • Filtering – Identify spam emails and automatically get them filtered into the spam folder. 
  • Cleaning – Get your home cleaned by a smart vacuum cleaner that moves around on its own and cleans the floor for you. 
  • Editing – Check texts for plagiarism and proofread and edit them by detecting grammatical, spelling, punctuation, and other linguistic mistakes. 

It is not always easy to tell which of these examples is about data science and which one is about Artificial Intelligence because many of these applications use both of them. This way, it becomes even clearer just how much overlap there is between these two fields and the technologies that come from them. 

What is your choice?

At the end of the day, data science and AI remain some of the most important technologies in our society and will likely help us invent more things and progress further. As a regular citizen, understanding the similarities and differences between the two will help you better understand how data science and Artificial Intelligence are used in almost all spheres of our lives. 

Top 8 Machine Learning algorithms explained in less than 1 minute each  
Albar Wahab
| October 25, 2022

In this blog, we will discuss the top 8 Machine Learning algorithms that will help you to receive and analyze input data to predict output values within an acceptable range

Machine learning algorithms
Top 8 machine learning algorithms explained

1. Linear Regression 

Linear regression
Linear regression – Machine learning algorithm – Data Science Dojo

Linear regression is a simple machine learning model and chances are you are already aware of it! Do you remember plotting the line y=mx+c in your introductory algebra class? This is an equation of a straight line where m is its gradient and c is the point where the line crosses the y-axis. Using this equation, you’re able to estimate the value of y for any given value of x. Similarly, linear regression involves estimating the relationship between independent variables (x) and a dependent variable(y).  

 

2. Logistic Regression 

Logistic regression
Logistic regression – Machine learning algorithm – Data Science Dojo

Just like linear regression, logistic regression is a machine learning model used to determine the relationship between a dependent variable and one or more independent variables. However, this model is used for classification analysis. This is because logistic regression predicts the probability of an event occurring. For a probability greater than 0.5, a value of 1 is assigned, and for less than that 0. For example, you can use logistic regression to predict whether a student will pass (1) an exam, or they will fail (0). 

 

3. Decision Trees 

Decision tree
Linear regression – Machine learning algorithm – Data Science Dojo

Decision tree is a supervised machine learning model that repeatedly splits the data based on a question corresponding to the features. The model learns the best way to reduce randomness and drafts a decision tree that can be used to predict the category of an item based on answering a selection of questions. For example, in the case of whether it will rain today or not, the questions can be whether it is sunny, did it rain yesterday, whether it is windy, and so on.  

 

4. Random Forest 

Random forest
Random forest – Machine learning algorithm – Data Science Dojo

Random Forest is a machine learning algorithm that works similarly to a decision tree. The difference is that random forest uses multiple decision trees to make a prediction and hence decreases overfitting. The process of majority voting is carried out and the class selected by most trees is assigned to an item. For example, if two trees predict it to be 0, and one tree predicts it to be 1, then the class of 0 will be assigned to the item.  

5. K-Nearest Neighbor 

K-nearest neighbour
K-nearest neighbor – Machine learning algorithm – Data Science Dojo

K-Nearest Neighbor is another simple machine learning algorithm that classifies new cases based on the category/class of the data points nearest to the new data point. That is, if most neighbors of an unknown item belong to class 1, then we assign class 1 to this unknown item. The number of neighbors to take into consideration is the value K assigned. If k=10, we will look at the 10 nearest neighbors of this item. The nearest neighbors are determined by measuring the distance using distance measures such as Euclidean distance, and the nearest are those that have the shortest distance. 

 

6. Support Vector Machine 

Support vector machine
Support vector machine – Machine learning algorithm – Data Science Dojo

Support vector machines by dividing the data points using a hyperplane which is a straight line. The points donated by the blue diamond form one class on the left side of the plane and the points donated by the green circle represent another class on the right side of the plane. If we want to predict the class of a new point, we can simply determine it by whether it lies on the left or right side of the hyperplane and where it is within the margin. 

7. K-Means clustering 

k-means clustering
K-means clustering – Machine learning algorithm

K-means clustering is an unsupervised machine learning algorithm. That means it is used to work with data points whose class is not already known. We can use the clustering algorithm to group similar items into clusters. The number of clusters is determined by the value of K assigned. For example, you assign K=3. Three clusters are selected at random, and we adjust them until they are highly distinct from one another. Distinct clusters will have points similar to each other but these points will be distinct from points in another cluster.

8. Naïve Bayes

Naive Bayes classifier
Naive Bayes classifier – Machine learning algorithm – Data Science Dojo

Naïve Bayes is a probabilistic machine learning model based on the Bayes theorem that assumes that all the features are independent of one another. Conditional probability refers to the probability of an outcome occurring if it is given that another event has occurred. This algorithm predicts the probability that an item belongs to a particular class and is assigned the class with the highest probability. 

Share more Machine Learning algorithms with us

Have we missed any Machine Learning algorithm that you would like to learn about? Share with us in the comments below

 

Hilarious Data Science Jokes
Data Science Dojo Staff
| September 21, 2022

Learning Data Science with fun is the missing ingredient for diligent data scientists. This blog post collected the best data science jokes including statistics, artificial intelligence, and machine learning.

 

Data Science jokes

 

For Data Scientists

1. There are two kinds of data scientists. 1.) Those who can extrapolate from incomplete data.

2. Data science is 80% preparing data, and 20% complaining about preparing data.

3. There are 10 kinds of people in this world. Those who understand binary and those who don’t.

4. What’s the difference between an introverted data analyst & an extroverted one? Answer: the extrovert stares at YOUR shoes.

5. Why did the chicken cross the road? The answer is trivial and is left as an exercise for the reader.

6. The data science motto: If at first, you don’t succeed; call it version 1.0

7. What do you get when you cross a pirate with a data scientist? Answer: Someone who specializes in Rrrr

8. A SQL query walks into a bar, walks up to two tables, and asks, “Can I join you?”

9. Why should you take a data scientist with you into the jungle? Answer: They can take care of Python problems

10. Old data analysts never die – they just get broken down by age

11. I don’t know any programming, but I still use Excel in my field!

12. Data is like people – interrogate it hard enough and it will tell you whatever you want to hear.

13. Don’t get it? We can help. Check out our in-person data science Bootcamp or online data science certificate program.

 

For Statisticians

14. Statistics may be dull, but it has its moments.

15. You are so mean that your standard deviation is zero.

16. How did the random variable get into the club? By showing a fake i.d.

17. Did you hear the one about the statistician? Probably….

18. Three statisticians went out hunting and came across a large deer. The first statistician fired, but missed, by a meter to the left. The second statistician fired, but also missed, by a meter to the right. The third statistician didn’t fire, but shouted in triumph, “On average we got it!”

19. Two random variables were talking in a bar. They thought they were being discreet, but I heard their chatter continuously.

20. Statisticians love whoever they spend the most time with; that’s their statistically significant other.

21. Old age is statistically good for you – very few people die past the age of 100.

22. Statistics prove offspring’s an inherited trait. If your parents didn’t have kids, odds are you won’t either.

 

For Artificial Intelligence experts

23. Artificial intelligence is no match for natural stupidity

24. Do neural networks dream of strictly convex sheep?

25. What did one support vector say to another support-vector? Answer: I feel so marginalized

26. AI blogs are like philosophy majors. They’re always trying to explain “deep learning.”

27. How many support vectors does it take to change a light bulb? Answer: Very few, but they must be careful not to shatter* it.

28. Parent: If all your friends jumped off a bridge, would you follow them? Machine Learning Algorithm: yes.

29. They call me Dirichlet because all my potential is latent and awaiting allocation

30. Batch algorithms: YOLO (You Only Learn Once), Online algorithms: Keep Updates and Carry On

31. “This new display can recognize speech” “What?” “This nudist play can wreck a nice beach”

32. Why did the naive Bayesian suddenly feel patriotic when he heard fireworks? Answer: He assumed independence

33. Why did the programmer quit their job? Answer: Because they didn’t get arrays.

34. What do you call a program that identifies spa treatments? Facial recognition!

35. Human: What do we want!?

  • Computer: Natural language processing!
  • Human: When do we want it!?
  • Computer: When do we want what?

 

36. A statistician’s wife had twins. He was delighted. He rang the minister who was also delighted. “Bring them to church on Sunday and we’ll baptize them,” said the minister. “No,” replied the statistician. “Baptize one. We’ll keep the other as a control.”

 

For Machine Learning professionals

37. I have a joke about a data miner, but you probably won’t dig it. @KDnuggets:

38. I have a joke about deep learning, but I can’t explain it. Shamail Saeed, @hacklavya

39. I have a joke about deep learning, but it is shallow. Mehmet Suzen, @memosisland

40. I have a machine learning joke, but it is not performing as well on a new audience. @dbredesen

41. I have a new joke about Bayesian inference, but you’d probably like the prior more. @pauljmey

42. I have a joke about Markov models, but it’s hidden somewhere. @AmeyKUMAR1

43. I have a statistics joke, but it’s not significant. @micheleveldsman

44. I have a geography joke, but I don’t know where it is. @olimould

45. I have an object-oriented programming joke. But it has no class. Ayin Vala

46. I have a quantum mechanics joke. It’s both funny and not funny at the same time. Philip Welch

47. I have a good Bayesian laugh that came from a prior joke. Nikhil Kumar Mishra

48. I have a java joke, but it is too verbose! Avneesh Sharma

49. I have a regression joke, but it sounds quite mean. Gang Su

50. I have a machine learning joke, but I cannot explain it. Andriy Burkov

 

Did we miss your favorite Data Science joke?

Share your favorite data science jokes with us in the comments below. Let’s laugh together!

Apache Zeppelin: Magnum Opus of MLOps
Saad Shaikh
| September 20, 2022

Data Science Dojo is offering Apache Zeppelin for FREE on Azure Marketplace packaged with pre-installed interpreters and backends to make Machine Learning easier than ever. 

Introduction 

How cumbersome and tiring it is to install different tools to perform your desired ML tasks and then look after the integration and dependency issues. Already getting headaches? Worry not, because Data Science Dojo’s Apache Zeppelin instance fixes all of that. But before we delve further into it, let’s get to know some basics. 

 

What are Machine Learning Operations?  

Machine Learning is a branch of Artificial Intelligence that deals with models that produce outcomes based on some learned pre-existing data. It provides automation and reduces the workload of users. ML converges with Data Science and Engineering and that gives birth to some necessary operations to be performed to acquire the output of any task.

These operations include ETL (Extraction, Transform, Load) or ELT, drawing interactive visualizations, running queries, training and testing ML models and several other functions. 

Pro Tip: Join our 6-months instructor-led Data Science Bootcamp to master machine learning skills. 

 

Challenges for individuals 

 Wanting to explore and visualize your data but not knowing the methodology of the new tool is not only a red flag but also demands extraneous skills to be learnt to proceed with your job. Or you would have to switch among different environments to achieve your goal which is again – time-consuming, and needless to say time is of the essence for data scientists and engineers when they must deliver a task.

In this scenario, switching from one tool to another which you may know how to use or may not, is time and cost intensive. What if a data driven interactive environment having several interpreters ready to be worked with in one place is provided and you just select your favorite language and break the ice? 

 

ML Operations with Apache Zeppelin 

Apache Zeppelin is an open-source tool that equips you with a web-based notebook that can be used for data processing and querying, handling big data, training and testing models, interactive data analytics, visualization, and exploration. Vibrant designs and pictures generated can save time for users in the identification of key patterns in data and ultimately accelerates the decision-making processes.

It contains different pre-installed interpreters but also allows you to plug in your own various language backends for desirability. Apache Zeppelin supports many data sources which allow you to synthesize your data to visualize into interactive plots and charts. You can also create dynamic forms in your notebook and can share your notebook with collaborators.              

Apache Zeppelin
Apache Zeppelin Data Science Dojo

          

(Picture Courtesy: https://zeppelin.apache.org/ ) 

 

Key features 

  • Zeppelin delivers an optimized and interactive UI that enhances the plots, charts, and other diagrams. You can also create dynamic forms in your notebook along with other markdowns to fancify your note 
  • It’s open-source and allows vendors to make Zeppelin highly customized according to use-case requirements that vary from industry to industry 
  • The choice to select a learned backend from a variety of pre-installed ones or the feasibility to add your own customizable language adds to the user-friendliness, flexibility, and adaptability 
  • It supports Big Data databases like Hive and Spark. It also provides support for web sockets so you can share your web page by echoing the output of the browser and creating live reports 
  • Zeppelin provides an in-build job manager who keeps track of the condition or status of various notebooks 

 

What Data Science Dojo has for you 

Our Zeppelin instance serves as a web-accessible programming environment with miscellaneous pre-installed interpreters. In our service users can switch between different interpreters like processing data with python and then visualizing it by querying with SQL. The pre-installed backends provide the feasibility to perform the task using your accustomed language instead of learning a new tool. 

  • A web-accessible Zeppelin environment 
  • Several pre-installed language-backends/interpreters 
  • Various tutorial notebooks containing codes for understandability 
  • A Job manager responsible for monitoring the status of the notebooks 
  • A Notebook Repos feature to manage your notebook repositories’ settings 
  • Ability to import notes from JSON file or URL 
  • In-build functionality to add and modify your own customized interpreters 
  • Credential management service 

 

Our instance supports the following interpreters: 

  • Alluxio 
  • Angular 
  • Beam 
  • BigQuery 

And many others which you check by taking a quick peek here: Zeppelin on Market Place  

Conclusion 

Efficient resource requirement for processing, visualizing, and training large data was one of the areas of concern when working on traditional desktop environments. The other area of concern includes the burden of working with non-familiar backends or switching among different accustomed environments. With our Zeppelin instance, both concerns are put to rest.

When coupled with Microsoft Azure services and processing speed, it outperforms the traditional counterparts because data-intensive computations aren’t performed locally, but in the cloud. You can collaborate and share notebooks with various stakeholders within and outside the company while monitoring the status of each 

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Zeppelin Notebook Environment dedicated specifically for Machine Learning and Data Science operations on Azure Market Place. Don’t wait to install this offer by Data Science Dojo, your ideal companion in your journey to learn data science! 

Click on the button below to head over to the Azure Marketplace and deploy Apache Zeppelin for FREE by clicking on “Get it now”.

Apache Zeppelin
Note: You’ll have to sign up to Azure, for free, if you do not have an existing account.

Alyshai Nadeem
| September 15, 2022

Be it Netflix, Amazon, or another mega-giant, their success stands on the shoulders of experts, analysts are busy deploying machine learning through supervised, unsupervised, and reinforcement successfully. 

The tremendous amount of data being generated via computers, smartphones, and other technologies can be overwhelming, especially for those who do not know what to make of it. To make the best use of data researchers and programmers often leverage machine learning for an engaging user experience.

Many advanced techniques that are coming up every day for data scientists of all supervised, and unsupervised, reinforcement learning is leveraged often. In this article, we will briefly explain what supervised, unsupervised, and reinforcement learning is, how they are different, and the relevant uses of each by well-renowned companies.

Machine learning
                                                                                    Machine Learning techniques –  Image Source

Supervised learning

Supervised machine learning is used for making predictions from data. To be able to do that, we need to know what to predict, which is also known as the target variable. The datasets where the target label is known are called labeled datasets to teach algorithms that can properly categorize data or predict outcomes. Therefore, for supervised learning:

  • We need to know the target value
  • Targets are known in labeled datasets

Let’s look at an example: If we want to predict the prices of houses, supervised learning can help us predict that. For this, we will train the model using characteristics of the houses, such as the area (sq ft.), the number of bedrooms, amenities nearby, and other similar characteristics, but most importantly the variable that needs to be predicted – the price of the house.

A supervised machine learning algorithm can make predictions such as predicting the different prices of the house using the features mentioned earlier, predicting trends of future sales, and many more.

Sometimes this information may be easily accessible while other times, it may prove to be costly, unavailable, or difficult to obtain, which is one of the main drawbacks of supervised learning.

Saniye Alabeyi, Senior Director Analyst at Garnet calls Supervised learning the backbone of today’s economy, stating:

“Through 2022, supervised learning will remain the type of ML utilized most by enterprise IT leaders” (Source).

Types of problems:

Supervised learning deals with two distinct kinds of problems:

  1. Classification problems
  2. Regression problems

 

Classification problem: In the case of classification problems, examples are classified into one or more classes/ categories.

For example, if we are trying to predict that a student will pass or fail based on their past profile, the prediction output will be “pass/fail.” Classification problems are often resolved using algorithms such as Naïve Bayes, Support Vector Machines, Logistic Regression, and many others.

Regression problem: A problem in which the output variable is either a real or continuous value, s is defined as a regression problem. Bringing back the student example, if we are trying to predict that a student will pass or fail based on their past profuse, the prediction output will be numeric, such as “68%” likely to score.

Predicting the prices of houses in an area is an example of a regression problem and can be solved using algorithms such as linear regression, non-linear regression, Bayesian linear regression, and many others.

Why Amazon, Netflix, and YouTube are great fans of supervised learning

Recommender systems are a notable example of supervised learning. E-commerce companies such as Amazon, streaming sites like Netflix, and social media platforms such as TikTok, Instagram, and even YouTube among many others make use of recommender systems to make appropriate recommendations to their target audience.

Unsupervised learning

Imagine receiving swathes of data with no obvious pattern in it. A dataset with no labels or target values cannot come up with an answer to what to predict. Does that mean the data is all waste? Nope! The dataset likely has many hidden patterns in it.

Unsupervised learning studies the underlying patterns and predicts the output. In simple terms, in unsupervised learning, the model is only provided with the data in which it looks for hidden or underlying patterns.

Unsupervised learning is most helpful for projects where individuals are unsure of what they are looking for in data. It is used to search for unknown similarities and differences in data to create corresponding groups.

An application of unsupervised learning is the categorization of users based on their social media activities.

Commonly used unsupervised machine learning algorithms include K-means clustering, neural networks, principal component analysis, hierarchical clustering, and many more.

Reinforcement learning

Another type of machine learning is reinforcement learning.

In reinforcement learning, algorithms learn in an environment on their own. The field has gained quite some popularity over the years and has produced a variety of learning algorithms.

Reinforcement learning is neither supervised nor unsupervised as it does not require labeled data or a training set. It relies on the ability to monitor the response to the actions of the learning agent.

Most used in gaming, robotics, and many other fields, reinforcement learning makes use of a learning agent. A start state and an end state are involved. For the learning agent to reach the final or end stage, different paths may be involved.

  • An agent may also try to manipulate its environment and may travel from one state to another
  • On success, the agent is rewarded but does not receive any reward or appreciation for failure
  • Amazon has robots picking and moving goods in warehouses because of reinforcement learning

Numerous IT companies including Google, IBM, Sony, Microsoft, and many others have established research centers focused on projects related to reinforcement learning.

Social media platforms like Facebook have also started implementing reinforcement learning models that can consider different inputs such as languages, integrate real-world variables such as fairness, privacy, and security, and more to mimic human behavior and interactions. (Source)

Amazon also employs reinforcement learning to teach robots in its warehouses and factories how to pick up and move goods.

Comparison between supervised, unsupervised, and reinforcement learning

Caption: Differences between supervised, unsupervised, and reinforcement learning algorithms

  Supervised learning  Unsupervised learning  Reinforcement learning 
Definition  Makes predictions from data  Segments and groups data  Reward-punishment system and interactive environment 
Types of data  Labelled data  Unlabeled data   Acts according to a policy with a final goal to reach (No or predefined data) 
Commercial value  High commercial and business value  Medium commercial and business value  Little commercial use yet 
Types of problems  Regression and classification  Association and Clustering  Exploitation or Exploration 
Supervision  Extra supervision  No  No supervision 
Algorithms  Linear Regression, Logistic Regression, SVM, KNN and so forth   K – Means clustering, 

C – Means, Apriori 

Q – Learning, 

SARSA 

Aim  Calculate outcomes  Discover underlying patterns  Learn a series of action 
Application  Risk Evaluation, Forecast Sales  Recommendation System, Anomaly Detection  Self-Driving Cars, Gaming, Healthcare 

Which is the better Machine Learning technique?

We learned about the three main members of the machine learning family essential for deep learning. Other kinds of learning are also available such as semi-supervised learning, or self-supervised learning.

Supervised, unsupervised, and reinforcement learning, are all used for different to complete diverse kinds of tasks. No single algorithm exists that can solve every problem, as problems of different natures require different approaches to resolve them.

Despite the many differences between the three types of learning, all of these can be used to build efficient and high-value machine learning and Artificial Intelligence applications. All techniques are used in different areas of research and development to help solve complex tasks and resolve challenges.

Was this article helpful? Let us know in the comments below.

If you would like to learn more about data science, machine learning, and artificial intelligence, visit the Data Science Dojo blog.

10 interesting machine learning conferences in Asia you should attend
Alyshai Nadeem
| August 26, 2022

Confused about which machine learning conferences you should attend? Here are our top 10 picks for the remaining months of 2022.

For aspiring data scientists, machine learners, and researchers, conferences are a great way to network, highlight their own work, and learn from others. This article highlights the top 10 machine learning conferences that you should attend if you are in Asia or are planning to visit soon.

1. ACAIT 2022: The 6th Asian Conference on Artificial Intelligence Technology – Changzhou, China

Taking place in the southern Jiangsu province of China, on the 4th of November, the ACAIT is a joint endeavor of the Institute of Electrical and Electronics Engineers (IEEE), Chinese Association for Artificial Intelligence (CAAI), and Changzhou Institute of Technology (CIT).

The conference invites significant and original research work from the world of artificial intelligence. The main aim of the conference is to provide an international forum for researchers to share their ideas and achievements in the field of artificial intelligence.

The conference covers all major topics from AI-related brain and cognitive sciences to machine Cognition and Pattern Recognition, Big data and knowledge engineering, Robotics, swarm intelligence, and even the Internet of Things.

Further details regarding the conference can be found here.

2. 4th Asian Conference on Machine Learning (ACML 2022) – Hyderabad, India

Taking place between 12th to 14th December in Hyderabad, India, the ACML abides by the post-pandemic laws and will be conducted virtually, as well as allow in-person interactions.

Focusing on theoretical and practical aspects of machine learning, the conference encourages researchers from around the globe to join and be a part of the conversation.

The conference will cover general machine learning topics such as supervised learning and reinforcement learning, and even dive deeper into Deep Learning, Probabilistic Methods, theoretical frameworks, and much more.

Further details regarding the conference can be found here.

3. The 29th International Conference on Computational Linguistics – Gyeongju, Republic of Korea

One of the most popular conferences on natural language processing and computational linguistics, COLING is expected to be held on October 12-17, 2022, in Gyeongju, South Korea.

The conference has been held every year since 1965. Participants from both top-ranked research centers and emerging countries attend this conference as it provides equal opportunities to researchers from educational institutes and academia, as well as from the corporate sector.

COLING focuses on all aspects of natural language processing and computation.

Not only is this one of the most prestigious conferences on NLP and computational linguistics, but it is also heavily sponsored by names such as LG Electronics, Hyundai, Google, and Apple, among many others.

Further details regarding the conference can be found here.

4. IROS 2022: International Conference on Intelligent Robots and Systems – Kyoto, Japan

One of the flagship conferences of the robotics community, IROS is one of the world’s oldest forums for the global robotics community to explore intelligent robots and systems. Held every year in Kyoto, Japan since 1987, the conference will be held on 23-27 October.

Not only does the conference feature numerous research works from various international authors, but the conference also includes workshops and training, as well as multiple guest lectures by professionals in academia and industry.

Further details regarding the conference can be found here.

5. ACCV 2022: The 16th Asian Conference on Computer Vision

The Asian Conference on Computer Vision (AACV) 2022 focuses on computer vision and pattern recognition and will be held on 4-8 December in Macau, China.

The biennial international conference is sponsored by the Asian Federation of Computer Vision and provides like-minded individuals an opportunity to discuss the latest problems, solutions, and technologies in the field of computer vision and other similar areas.

The conference proceedings are published by Springer as Lecture Notes. Moreover, the award-winning papers are invited for publication in a special issue of the International Journal of Computer Vision (IJCV).

More details on the conference can be found here.

6. The 29th International Conference on Neural Information Processing (ICONIP 2022), New Delhi, India

One of the leading international conferences in the fields of pattern recognition, neuroscience, intelligent control, information security, and brain-machine interface, the ICONIP will be held in New Delhi, India on 22nd -26th November 2022.

It is the annual flagship conference organized by the Asia Pacific Neural Network Society (APNNS), which strives towards bridging the gap between educational institutions and industry.

The conference provides an international forum for anyone working in neuroscience, neural networks, deep learning, and other similar fields.

The conference is divided into four categories: Theory and Algorithms, Computational and Cognitive Neurosciences, Human-Centered Computing, and other machine learning applications.

Further details on the conference can be found here.

7. The 19th Pacific Rim International Conference on Artificial Intelligence (PRICAI) – Shanghai, China

A biennial international conference, the PRICAI focuses on AI theories, technologies, and their applications in areas of social and economic importance, specifically focusing on countries in the Pacific Rim. Held since 1990, PRICAI will take place on 10-13th November, in the financial hub of China – Shanghai.

The conference focuses on all things related to AI, machine learning, data mining, robotics, computer vision, and much more.

Further information regarding the conference can be found here.

8. The 4th International Conference on Data-driven Optimization of Complex Systems (DOCS2022) – Chengdu, China

Focused on data-driven optimization, learning and control, and their applications to complex systems, DOCS 2022 will be held 23-25th September, Chengdu, Sichuan, China.

The conference focuses on topics ranging from data-driven machine learning, optimization, decision-making, analysis, and application.

Further details on the conference can be found here.

9. The 9th IEEE International Conference on Data Science and Advanced Analytics (DSAA) – Shenzhen, China

Widely recognized as a dedicated flagship annual conference, the International Conference on Data Science and Advanced Analytics (DSAA) will be held in Shenzhen, China on the 13th –16th of October 2022.

The conference not only focuses on computing and information/intelligence sciences but also considers their relationship with statistics, and the crossover of data science and analytics.

An interesting aspect of this conference is that it is a dual-track conference with both a research track and an application track. Further details regarding these different tracks can be found here.

While more details on the conference can be found here.

10. The 5th International Conference on Intelligent Autonomous Systems (ICoIAS 2022) – Dalian, China

The ICoIAS conference focuses on intelligent autonomous systems that play a significant role in multiple control and engineering applications.

The conference will be held on 23-25 September at the Dalian Maritime University, Dalian, China, in collaboration with Tianjin University, the IEEE Computational Intelligence Society, and The Institution of Engineers, Singapore.

The conference focuses on distinct aspects of intelligent autonomous systems. Moreover, IEEE fellows from all over the world are expected to attend the conference as guest speakers.

For further information regarding the conference, click here.

 

Was this list helpful? Let us know in the comments below. If you would like to find similar conferences in a different area, click here.

If you are interested in learning more about machine learning and data science, click here.

Complete the tutorial to revisit and master the fundamentals of decision trees classification models, one of the simplest and easiest models to explain.

Introduction

Data Scientists use machine learning techniques to make predictions under a variety of scenarios. Machine learning can be used to predict whether a borrower will default on his mortgage or not, or what might be the median house value in a given zip code area. Depending upon whether the prediction is being made for a quantitative variable or a qualitative variable, a predictive model can be categorized as regression model (e.g. predicting median house values) or classification (e.g. predicting loan defaults) model.

Decision trees happen to be one of simplest and easiest classification models to explain and, as many argue, closely resemble human decision making.

This tutorial has been developed to help you revisit and master the fundamentals of decision tree classification models which are expanded on in Data Science Dojo’s data science bootcamp and online data science certificate program. Our key focus will be to discuss the:

  1. Fundamental concepts on data-partitioning, recursive binary splitting, and nodes etc.
  2. Data exploration and data preparation for building classification models
  3. Performance metrics for decision tree models – Gini Index, Entropy, and Classification Error.

The content builds your classification model knowledge and skills in an intuitive and gradual manner.


The scenario

You are a Data Scientist working at the Centers for Disease Control (CDC) Division for Heart Disease and Stroke Prevention. Your division has recently completed a research study to collect health examination data among 303 patients who presented with chest pain and might have been suffering from heart disease.

The Chief Data Scientist of your division has asked you to analyze this data and build a predictive model that can accurately predict patients’ heart disease status, identifying the most important predictors of heart failure. Once your predictive model is ready, you will make a presentation to the doctors working at the health facilities where the research was conducted.

The data set has 14 attributes including patients’ age, gender, blood pressure, cholesterol level, and heart disease status indicating whether the diagnosed patient was found to have heart disease or not. You have already learned that to predict quantitative attributes such as “blood pressure” or “cholesterol level”, regression models are used but to predict a qualitative attribute such as the “status of heart disease”, classification models are used.

Classification models can be built using different techniques such as Logistic Regression, Discriminant Analysis, K-Nearest Neighbors (KNN), Decision Trees, etc. Decision Trees are very easy to explain and can easily handle qualitative predictors without the need to create dummy variables.

Although decision trees generally do not have the same level of predictive accuracy as the K-Nearest Neighbor or Discriminant Analysis techniques, They serve as building blocks for other sophisticated classification techniques such as “Random Forest” etc. which makes mastering Decision Trees, necessary!

We will now build decision trees to predict the status of heart disease i.e. to predict whether the patient has heart disease or not, and we will learn and explore the following topics along the way:

  • Data preparation for decision tree models
  • Classification trees using “rpart” package
  • Pruning the decision trees
  • Evaluating decision tree models

## You will need following libraries for this exercise 
library(dplyr) 
library(tidyverse)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(rattle)
library(RColorBrewer)

## Following code will help you suppress the messages and warnings during package loading      
options(warn = -1) 

The data

You will be working with the Heart Disease Data Set which is available at UC Irvine’s Machine Learning Repository. You are encouraged to visit the repository and go through the data description. As you will find, the data folder has multiple data files available. You will use the processed.cleveland.data.

Let’s read the datafile into a dataframe “cardio”

## Reading the data into "cardio" data frame
cardio <- read.csv("processed.cleveland.data", header = FALSE, na.strings = '?')            
## Let's look at the first few rows in the cardio data frame  
head(cardio)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
56 1 2 120 236 0 0 178 0 0.8 1 0 3 0

As you can see, this dataframe doesn’t have column names. However, we can refer to the data dictionary, given below, and add the column names:

Column Position Attribute Name Description Attribute Type
#1 Age Age of Patient Quantitative
#2 Sex Gender of Patient Qualitative
#3 CP Type of Chest Pain (1: Typical Angina, 2: Atypical Angina, 3: Non-anginal Pain, 4: Asymptomatic) Qualitative
#4 Trestbps Resting Blood Pressure (in mm Hg on admission) Quantitative
#5 Chol Serum Cholestrol in mg/dl Quantitative
#6 FBS (Fasting Blood Sugar>120 mg/dl) 1=true; 0=false Qualitative
#7 Restecg Resting ECG results (0=normal; 1 and 2 = abnormal) Qualitative
#8 Thalach Mazimum Heart Rate Achieved Quantitative
#9 Exang Exercise Induced Angina (1=yes; 0=no) Qualitative
#10 Oldpeak ST Depression Induced by Exercise Relative to Rest Quantitative
#11 Slope The slope of peak exercise st segment (1=upsloping; 2=flat; 3=downsloping) Qualitative
#12 CA Number of major vessels (0-3) colored by flourosopy Qualitative
#13 Thal Thalassemia (3=normal; 6=fixed defect; 7=reversable defect) Qualitative
#14 NUM Angiographic disease status (0=no heart disease; more than 0=no heart disease) Qualitative

The following code chunk will add column names to your dataframe:

## Adding column names to dataframe 
names(cardio) <- c( "age", "sex", "cp", "trestbps", "chol","fbs", "restecg", 
                           "thalach","exang", "oldpeak","slope", "ca", "thal", "status")

You are going to build a decision tree model to predict values under variable #14 status, the “angiographic disease status” which labels or classifies each patient as “having heart disease” or “not having heart disease.

Intuitively, we expect some of these other 13 variables to help us predict the values under status. In other words, we expect variables #1 to #13, to segment the patients or create partitions in the cardio data frame in a manner that any given partition (or segment) thus created either has patients as “having heart disease” or “not having heart disease.


Data preparation for decision trees

It is time to get familiar with the data. Let’s begin with data types.

## We will use str() function  
str(cardio)
'data.frame':	303 obs. of  14 variables:
 $ age      : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex      : num  1 1 1 1 0 1 0 0 1 1 ...
 $ cp       : num  1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps : num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol     : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs      : num  1 0 0 0 0 0 0 0 0 1 ...
 $ restecg  : num  2 2 2 0 2 0 2 0 2 2 ...
 $ thalach  : num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang    : num  0 1 1 0 0 0 0 1 0 1 ...
 $ oldpeak  : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope    : num  3 2 2 3 1 1 3 1 2 3 ...
 $ ca       : num  0 3 2 0 0 0 2 0 1 0 ...
 $ thal     : num  6 3 7 3 3 3 3 3 7 7 ...
 $ status   : int  0 2 1 0 0 0 3 0 2 1 ...

As you can see, some qualitative variables in our data frame are included as quantitative variables

  • status is declared as $$ which makes it a quantitative variable but we know disease status must be qualitative
  • You can see that sexcpfbsrestecgexang,  slopeca and thal too
    must be qualitative

The next code-chunk, will convert and correct the datatypes:

## We can use lapply to convert data types across multiple columns  
cardio[c("sex", "cp", "fbs","restecg", "exang", 
                     "slope", "ca", "thal", "status")] <- lapply(cardio[c("sex", "cp", "fbs","restecg",
                                                                         "exang", "slope", "ca", "thal", "status")], factor)
## You can verify the data frame 
str(cardio)
'data.frame':	303 obs. of  14 variables:
 $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 1 2 2 ...
 $ cp      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
 $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
 $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
 $ ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
 $ thal    : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
 $ status  : Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 1 4 1 3 2 ...

Also note that status has 5 different values viz. 0, 1, 2, 3, 4. While status = 0 , indicates no heart disease, all other values under status indicate a heart disease. In this exercise, you are building a decision tree model to classify each patient as “normal”(not having heart disease) or “abnormal” (having heart disease)”.

Therefore, you can merge status = 1, 2, 3, and 4 into a single level status = “1”. This way you will convert status into a  Binary or Dichotomous variable having only two values status = “0” (normal) and status = “1” (abnormal)

Let’s do that!

##  We will use the 'forcats' package included in the s'tidyverse' package
##  The function to be used will be fct_collpase 
cardio$status <- fct_collapse(cardio$status, "1" = c("1","2", "3", "4"))  


## Let's also change the labels under the "status" from (0,1) to (normal, abnormal)  
levels(cardio$status) <- c("normal", "abnormal")  

## levels under sex can also be changed to (female, male)   
## We can change level names in other categorical variables as well but we are not doing that  
levels(cardio$sex) <- c("female", "male")  

So, you have corrected the data types. What’s next?

How about getting a summary for all the variables in the data?

## Overall summary of all the columns 
summary(cardio)
      age            sex      cp         trestbps          chol       fbs    
 Min.   :29.00   female: 97   1: 23   Min.   : 94.0   Min.   :126.0   0:258  
 1st Qu.:48.00   male  :206   2: 50   1st Qu.:120.0   1st Qu.:211.0   1: 45  
 Median :56.00                3: 86   Median :130.0   Median :241.0          
 Mean   :54.44                4:144   Mean   :131.7   Mean   :246.7          
 3rd Qu.:61.00                        3rd Qu.:140.0   3rd Qu.:275.0          
 Max.   :77.00                        Max.   :200.0   Max.   :564.0

 restecg    thalach      exang      oldpeak     slope      ca        thal    
 0:151   Min.   : 71.0   0:204   Min.   :0.00   1:142   0   :176   3   :166  
 1:  4   1st Qu.:133.5   1: 99   1st Qu.:0.00   2:140   1   : 65   6   : 18  
 2:148   Median :153.0           Median :0.80   3: 21   2   : 38   7   :117  
         Mean   :149.6           Mean   :1.04           3   : 20   NA's:  2  
         3rd Qu.:166.0           3rd Qu.:1.60           NA's:  4             
         Max.   :202.0           Max.   :6.20                                

       status   
 normal  :164  
 abnormal:139  


Did you notice the missing values (NA’s) under the ca and thal columns? With the following code, you can count the missing values across all the columns in your data frame.

# Counting the missing values in the datframe 
sum(is.na(cardio))
6

Only 6 missing values across 303 rows which is approximately 2%. That seems to be a very low proportion of missing values. What do you want to do with these missing values, before you started building your decision tree model?

  • Option1: discard the missing values before training.
  • Option2: rely on the machine learning algorithm to deal with missing values during the model training.
  • Option3: impute missing values before training.

For most learning methods, Option3 the imputation approach is necessary. The simplest approach is to impute the missing values by the mean or median of the non-missing values for the given feature.

The choice of Option 2 depends on the learning algorithm. Learning algorithms such as CART and rpart simply ignore missing values determining the quality of a split. To determine, whether a case with a missing value for the best split is to be sent left or right, the algorithm uses surrogate splits. You may want to read more on this here.

However, if the relative amount of missing data is small, you can go for Option1 and discard the missing values as long as it doesn’t lead to or further alleviates the class imbalance which is briefly discussed in the following section.

As for your data set, you are safe to delete missing value cases. The following code-chunk does that for you.

## Removing missing values  
cardio <- na.omit(cardio)

Data exploration

Status is the variable that you want to predict with your model. As we have discussed earlier, other variables in the cardio dataset should help you predict status.

For example, amongst patients with heart disease, you might expect the average value of Cholesterol levels (chol), to be higher than amongst those who are normal. Likewise, amongst patients with high blood sugar (fbs = 1), the proportion of patients with heart disease would be expected to be higher than what it is amongst patients who are normal. You can do some data visualization and exploration.

You may want to start with a distribution of status. Following code-chunk will provide you that:

## plotting a histrogram for status
cardio %>%
          ggplot(aes(x = status)) + 
          geom_histogram(stat = 'count', fill = "steelblue") +
          theme_bw()

From this histogram, you can observe that there is almost an equal split between patients having status as normal and abnormal.

This may not always be the case. There might be datasets in which one of the classes in the predicted variable has a very low proportion. Such datasets are said to have a class imbalance problem where one of the classes in the predicted variable is rare within the dataset.

Credit Card Fraud Detection Model or a Mortgage Loan Default Model are some examples of classification models that are built with a dataset having a class imbalance problem. What other scenarios come to your mind?

You are encouraged to read this article: ROSE: A Package for Binary Imbalanced Learning

You should now explore the distribution of quantitative variables. You can make density plots with frequency counts on Y-axis and split the plot by the two levels in the status variable.

The following code will produce the plots arranged in a grid of 2 rows

## frequency plots for quantitative variables, split by status  
cardio %>%
  gather(-sex, -cp, -fbs, -restecg, -exang, -slope, -ca, -thal, -status, key = "var", value = "value") %>%
            ggplot(aes(x = value, y = ..count.. , colour = status)) +
            scale_color_manual(values=c("#008000", "#FF0000"))+
            geom_density() +
            facet_wrap(~var, scales = "free",  nrow = 2) +
            theme_bw()

What are your observations from the quantitative plots? Some of your observations might be:

  • In all the plots, as we move along the X-axis, the abnormal curve, mostly but not always, lies below the normal curve. You should expect this as the total number of patients under abnormal is
    smaller. However, for some values on the X-axis (which could be smaller values of X or larger, depending upon the predictor), the abnormal curve lies above.
  • For example, look at the age plot. Till x = 55 years, the majority of patients are included in the normal curve. Once x > 55 years, the majority goes to patients
    with
    abnormal and remains so until x = 68 years. Intuitively, age could be a good predictor of status and you may want to partition the data at x = 55 years
    and then again at x = 68 years. When you build your decision tree model, you may expect internal nodes with x > 55 years and x > 68 years.
  • Next, observe the plot for chol. Except for a narrow range (x = 275 mg/dl to x = 300 mg/dl), the normal curve always lies above the abnormal curve. You may want to
    form a hypothesis that Cholesterol is not a good predictor of status. In other words, you may not expect chol to be amongst the earliest internal nodes in your decision
    tree model.

Likewise, you can make hypotheses for other quantitative variables as well. Of course, your decision tree model will help you validate your hypothesis.

Now you may want to turn your attention to qualitative variables.

## frequency plots for qualitative variables, split by status  
cardio %>%
       gather(-age, -trestbps, -chol, -thalach, -oldpeak, -status, key = "var", value = "value") %>%
        ggplot(aes(x = value, color = status)) + 
         scale_color_manual(values=c("#008000", "#FF0000"))+
          geom_histogram(stat = 'count', fill = "white") +
          facet_wrap(~var, nrow = 3) +
          facet_wrap(~var, scales = "free",  nrow = 3) +
          theme_bw()

What are your observations from the qualitative plots? How do you want to partition data along the qualitative variables?

  • Observe the cp or the chest pain plot. The presence of asymptotic chest pain indicated by cp = 4, could provide a partition in the data and could be among the earliest nodes in your decision tree.
  • Likewise, observe the sex plot. Clearly, the proportion of abnormal is much lower (approximately 25%) among females compared to the proportion among males (approximately
    50%). Intuitively, sex might also be a good predictor and you may want to partition the patients’ data along sex. When you build your decision tree model, you may expect internal nodes with sex.

At this point, you may want to go back to both plots and list down the partition (variables and, more importantly, variable values) that you expect to find in your decision tree model.

Of course, all our hypotheses will get validated once we build our decision tree model.


Partitioning data: Training and test sets

Before you start building your decision tree, split the cardio data into a training set and test set:

cardio.train: 70% of the dataset

cardio.test: 30% of the dataset

Following code-chunk will do that:

## Now you can randomly split your data in to 70% training set and 30% test set   
## You should set seed to ensure that you get the same training vs/ test split every time you run the code    
set.seed(1) 

## randomly extract row numbers in cardio dataset which will be included in the training set  
train.index <- sample(1:nrow(cardio), round(0.70*nrow(cardio),0))

## subset cardio data set to include only the rows in train.index to get cardio.train  
cardio.train <- cardio[train.index, ]

## subset cardio data set to include only the rows NOT in train.index to get cardio.test  
## Did you note the negative sign?
cardio.test <- cardio[-train.index,  ]

Classification trees using rpart

 

“rpart” Package

You will now use rpart package to build your decision tree model. The decision tree that you will build, can be plotted using packages rpart.plot or rattle which provides better-looking plots.

You will use function rpart() to build your decision tree model. The function has the following key arguments:

formula: rpart(, …)

The formula where you declare what all predictors you are using in your decision tree. You can specify staus ~. to indicate that you want to use all the predictors in your decision tree.

method: rpart(method = < >, …)

The same function can be used to build a decision tree as well as a regression tree. You can use “class” to specify that you are using rpart() function for building a classification tree. If you were building a regression tree, you would specify “anova” instead.

cp rpart(cp = <>,…)

The main role of the Complexity Parameter (cp) is to control the size of the decision tree. Any split that does not reduce the tree’s overall complexity by a factor of cp is not attempted. The default value is  0.01. A value of cp = 1 will result in a tree with no splits. Setting cp to negative values ensures a fully grown tree.

minsplit  rpart( minsplit = <>, …)

The minimum number of observations must exist in a node in order for a split to be attempted. The default value is 20.

minbucket  rpart( minbucket = <>, …)

The minimum number of observations in any terminal node. If only one of minbucket or minsplit is specified, the code either sets minsplit to minbucket*3 or minbucket to minsplit/3, which is the default.

You are encouraged to read the package documentation rpart documentation

You can build a decision tree using all the predictors and with a cp = 0.05. The following code chunk will build your decision tree model:

## using all the predictors and setting cp = 0.05 
cardio.train.fit <- rpart(status ~ . , data = cardio.train, method = "class", cp = 0.05)

It is time to plot your decision tree. You can use the function rpart.plot() for plotting your tree. However, the function fancyRpartPlot() in the rattle package is more ‘fancy’

## Using fancyRpartPlot() from "rattle" package
fancyRpartPlot(cardio.train.fit, palettes = c("Greens", "Reds"), sub = "")

Interpreting decision tree plot

What are your observations from your decision tree plot?

Each square box is a node of one or the other type (discussed below):

Root Node cp = 1, 2, 3: The root node represents the entire population or 100% of the sample.

Decision Nodes  thal = 3, and  ca = 0: These are the two internal nodes that get split up either in further internal nodes or in terminal nodes. There are 3 decision nodes here.

Terminal Nodes (Leaf): The nodes that do not split further, are called terminal nodes or leaves. Your decision tree has 4 terminal nodes.

The decision tree plot gives the following information:

Predictors Used in Model: Only the thalcp, and ca variables are included in this decision tree.

Predicted Probabilities: Predicted probability of a patient being normal or abnormal. Note that the two probabilities add to 100%, at each node.

Node Purities: Each node has two proportions written left and right. The leftmost leaf has 0.82 and 0.18. The number on left, 0.82 tells you what proportion of the node actually belongs to the predicted class. You can see that this leaf has 82% purity.

Sample Proportion: Each node has a proportion of the sample. The proportion is 100% for the root node. The percentages under the split nodes add up to give the percentage in their parent node.

Predicted class: Each node shows the predicted class as normal or abnormal. It is the most commonly occurring predictor class in that node but the node might still include observations belonging to the other predictor class as well. This forms the concept of node impurity.


Fully grown decision tree

Is this the fully-grown decision tree?

No! Recall that you have grown the decision tree with the default value of cp = 0.05 which ensured that your decision tree doesn’t include any split that does not decrease the overall lack of fit by a factor of 5%.

However, if you change this parameter, you might get a different decision tree. Run the following code-chunk to get the plot of a fully grown decision tree, with a cp = 0

## using all the predictors and setting all other arguments to default 
cardioFull <- rpart(status ~ . , data = cardio.train, method = "class", cp = 0)

## Using fancyRpartPlot() from "rattle" package
fancyRpartPlot(cardioFull, palettes = c("Greens", "Reds"),sub = "")

The fully grown tree adds two more predictors thal and oldpeak to the tree that you built earlier. Now you have seen that changing the cp parameter, gives a decision tree of different sizes – more nodes and/or more leaves. At this stage, you might want to ask the following questions:

  • Which of the two decision trees you should go ahead with and present to your division’s Chief Data Scientist? The one  developed with a default value of cp = 0.01 or the one with cp = 0?
  • Does a bigger decision tree present a better classification model or worse?
  • Is the default value of cp = 0.01, the best possible?
  • How would you select a cp value that ensures the best performing decision tree model

There are no thumb rules on how large or small a decision tree should grow. However, you should be aware that:

  • large tree might overfit the data and thus might lead to a model with high variance
  • small tree might miss important parameters and thus might lead to a model with a high bias

So, which of the two decision trees you should present to your division’s Chief Data Scientist? What are the parameters that you can control to build your best decision tree? What are the metrics that you can use to justify the performance of your decision tree model? Conversely, what are the metrics that can help you evaluate the performance of your decision tree model?


Pruning the decision trees

Optimal tree size is chosen adaptively from the training data. The recommended approach is to build a fully-grown decision tree and then extract a nested sub-tree (prune it) in a way that you are left with a tree that has minimal node impurities.

As you have learned in your in-class module, there are three different metrics to calculate the node impurities that can be used for a given node m:

Gini Index:

A measure of total variance across all the classes in the predictor variable. A smaller value of G indicates a purer or more homogeneous node.

Gini Index

Here, Pmk gives the proportion of training observations in the mth region that are from the kth class.

Cross-Entropy or Deviance:

Another measure of node impurity:

Cross-Entropy or Deviance

As with the Gini-index, the mth node is purer if the entropy D is smaller.

In your fitted decision tree model, there are two classes in the predictor variable therefore K = 2 and there are m = 5 regions.

Misclassification Error:

The fraction of the training observations in the mth node that do not belong to the most common class:

Misclassification Error

When growing a decision tree, Gini Index or Entropy are typically used to evaluate the quality of the split.

However, for pruning the tree, Misclassification Error is used.

You can now get back to the fully grown decision tree that you built with cp = 0.

The Complexity Parameter Table will help you evaluate the fitted decision tree model. For your decision tree cardio.train.full, you can print the complexity parameter table using printcp() as well as plot using plotcp()

The CP table will help you select the decision tree that minimizes the misclassification error. CP table lists down all the trees nested within the fitted tree. The best-nested sub-tree can then be extracted by selecting the corresponding value for cp.

The following code will print the CP table for you:

## printing the CP table for the fully-grown tree 
printcp(cardioFull)
Classification tree:
rpart(formula = status ~ ., data = cardio.train, method = "class", 
    cp = 0)

Variables actually used in tree construction:
[1] ca      cp      oldpeak thal    thalach

Root node error: 95/208 = 0.45673

n= 208 

        CP nsplit rel error  xerror     xstd
1 0.536842      0   1.00000 1.00000 0.075622
2 0.063158      1   0.46316 0.52632 0.064872
3 0.031579      3   0.33684 0.38947 0.058056
4 0.015789      4   0.30526 0.35789 0.056138
5 0.000000      6   0.27368 0.36842 0.056794

The plotcp() gives a visual representation of the cross-validation results in an rpart object.

## plotting the cp 
plotcp(cardioFull, lty = 3, col = 2, upper = "splits" )

CP table

How do we interpret the cp table? What is your objective here?

Your objective is to prune the fitted tree i.e. select a nested sub-tree from this fitted tree, such that the cross-validated error or the xerror is the minimum.

The Complexity table for your decision tree lists down all the trees nested within the fitted tree. The complexity table is printed from the smallest tree possible (nsplit = 0 i.e. no splits) to the largest one (nsplit = 8, eight splits). The number of nodes included in the sub-tree is always 1+ the number of splits.

For easier reading, the error columns have been scaled so that the first node (nsplit = 0) has an error of 1. In your decision tree the model with no splits makes 123/267 misclassifications, you can multiply the columns rel errorxerror, and xstd by 123 to get the absolute values. In the first column, the complexity parameter has been similarly scaled. From the cp table we want to select the cp value that minimizes the cross-validated error (xerror).

CP plot

plotcp() gives a visual representation of the CP table. The Y- axis of the plot has the xerrors and the X-axis has the geometric means of the intervals of cp values, for which pruning is optimal. The red horizontal line is drawn 1-SE above the minimum of the curve. A good choice of cp for pruning is typical, the leftmost value for which the mean lies below the red line.

The following code-chunk will help you select the best cp from the cp table

## selecting the best cp, corresponding to the minimum value in xerror 
bestcp <- cardioFull$cptable[which.min(cardioFull$cptable[,"xerror"]),"CP"]

## print the best cp
bestcp

0.0157894736842105

You can now use this bestcp to prune the fully-grown decision tree

## Prune the tree using the best cp.
cardio.pruned <- prune(cardioFull, cp = bestcp)
## You can now plot the pruned tree 
fancyRpartPlot(cardio.pruned, palettes = c("Greens", "Reds"), sub = "")   

You can use the summary() function to get a detailed summary of the pruned decision tree. It prints the call, the table shown by printcp, the variable importance (summing to 100), and details for each node (the details depending on the type of tree).

## printing the 
summary(cardio.pruned)  
Call:
rpart(formula = status ~ ., data = cardio.train, method = "class", 
    cp = 0)
  n= 208 

          CP nsplit rel error    xerror       xstd
1 0.53684211      0 1.0000000 1.0000000 0.07562158
2 0.06315789      1 0.4631579 0.5263158 0.06487215
3 0.03157895      3 0.3368421 0.3894737 0.05805554
4 0.01578947      4 0.3052632 0.3578947 0.05613824

Variable importance
      cp     thal    exang  thalach       ca  oldpeak trestbps      age 
      28       17       14       13       12       12        3        2 
     sex 
       1 

Node number 1: 208 observations,    complexity param=0.5368421
  predicted class=normal    expected loss=0.4567308  P(node) =1
    class counts:   113    95
   probabilities: 0.543 0.457 
  left son=2 (109 obs) right son=3 (99 obs)
  Primary splits:
      cp      splits as  LLLR,      improve=34.19697, (0 missing)
      thal    splits as  LRR,       improve=31.59722, (0 missing)
      exang   splits as  LR,        improve=23.76356, (0 missing)
      ca      splits as  LRRR,      improve=21.46291, (0 missing)
      thalach < 147.5 to the right, improve=17.90570, (0 missing)
  Surrogate splits:
      exang   splits as  LR,        agree=0.731, adj=0.434, (0 split)
      thal    splits as  LRR,       agree=0.702, adj=0.374, (0 split)
      thalach < 148.5 to the right, agree=0.683, adj=0.333, (0 split)
      ca      splits as  LRRR,      agree=0.625, adj=0.212, (0 split)
      oldpeak < 0.85  to the left,  agree=0.611, adj=0.182, (0 split)

Node number 2: 109 observations,    complexity param=0.03157895
  predicted class=normal    expected loss=0.1834862  P(node) =0.5240385
    class counts:    89    20
   probabilities: 0.817 0.183 
  left son=4 (98 obs) right son=5 (11 obs)
  Primary splits:
      oldpeak < 1.95  to the left,  improve=5.018621, (0 missing)
      slope   splits as  LRL,       improve=4.913298, (0 missing)
      thal    splits as  LRR,       improve=4.888193, (0 missing)
      ca      splits as  LRRR,      improve=3.642018, (0 missing)
      thalach < 152.5 to the right, improve=3.280350, (0 missing)

Node number 3: 99 observations,    complexity param=0.06315789
  predicted class=abnormal  expected loss=0.2424242  P(node) =0.4759615
    class counts:    24    75
   probabilities: 0.242 0.758 
  left son=6 (35 obs) right son=7 (64 obs)
  Primary splits:
      thal    splits as  LRR,       improve=8.002922, (0 missing)
      exang   splits as  LR,        improve=7.972659, (0 missing)
      ca      splits as  LRRR,      improve=7.539716, (0 missing)
      oldpeak < 0.7   to the left,  improve=3.625175, (0 missing)
      thalach < 175   to the right, improve=3.354320, (0 missing)
  Surrogate splits:
      trestbps < 116   to the left,  agree=0.717, adj=0.200, (0 split)
      oldpeak  < 0.05  to the left,  agree=0.707, adj=0.171, (0 split)
      thalach  < 175   to the right, agree=0.697, adj=0.143, (0 split)
      sex      splits as  LR,        agree=0.677, adj=0.086, (0 split)
      age      < 69.5  to the right, agree=0.667, adj=0.057, (0 split)

Node number 4: 98 observations
  predicted class=normal    expected loss=0.1326531  P(node) =0.4711538
    class counts:    85    13
   probabilities: 0.867 0.133 

Node number 5: 11 observations
  predicted class=abnormal  expected loss=0.3636364  P(node) =0.05288462
    class counts:     4     7
   probabilities: 0.364 0.636 

Node number 6: 35 observations,    complexity param=0.06315789
  predicted class=normal    expected loss=0.4857143  P(node) =0.1682692
    class counts:    18    17
   probabilities: 0.514 0.486 
  left son=12 (20 obs) right son=13 (15 obs)
  Primary splits:
      ca       splits as  LRRR,      improve=7.619048, (0 missing)
      exang    splits as  LR,        improve=6.294925, (0 missing)
      trestbps < 126.5 to the right, improve=2.519048, (0 missing)
      thalach  < 170   to the right, improve=2.057143, (0 missing)
      age      < 53.5  to the left,  improve=1.866667, (0 missing)
  Surrogate splits:
      thalach  < 134   to the right, agree=0.743, adj=0.400, (0 split)
      trestbps < 129   to the right, agree=0.714, adj=0.333, (0 split)
      exang    splits as  LR,        agree=0.686, adj=0.267, (0 split)
      oldpeak  < 1.7   to the left,  agree=0.686, adj=0.267, (0 split)
      age      < 62.5  to the left,  agree=0.657, adj=0.200, (0 split)

Node number 7: 64 observations
  predicted class=abnormal  expected loss=0.09375  P(node) =0.3076923
    class counts:     6    58
   probabilities: 0.094 0.906 

Node number 12: 20 observations
  predicted class=normal    expected loss=0.2  P(node) =0.09615385
    class counts:    16     4
   probabilities: 0.800 0.200 

Node number 13: 15 observations
  predicted class=abnormal  expected loss=0.1333333  P(node) =0.07211538
    class counts:     2    13
   probabilities: 0.133 0.867 

Evaluating decision tree models

You can now use the predict function in rpart package to predict the status of patients included in the test data cardio.test

Following code-chunk predicts the status values for test data and will also print the confusion matrix for actual v/s. predicted values:

## You can now use your pruned tree model to predict the status for your test data 
cardio.predict <- predict(cardio.pruned, cardio.test, type = "class")

You should now evaluate the performance of your model on the test data. You will use your Confusion Matrix and calculate the Classification Error in the predictions:

# confusion matrix (training data)
conf.matrix <- table(cardio.test$status, cardio.predict)
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ":")
colnames(conf.matrix) <- paste("Predicted", colnames(conf.matrix), sep = ":")
print(conf.matrix)
                 cardio.predict
                  Predicted:normal Predicted:abnormal
  Actual:normal                 40                  7
  Actual:abnormal               14                 28

You can calculate the classification error as:

## caclulating the classification error 
round((14 + 7)/89,3)
0.236

So, your decision tree has 23.6% prediction error. In other words, your model has been able to classify the patients as normal or abnormal with an accuracy of 76.4%. Your division’s Chief Data Scientist should be impressed. Also, you have a classification model that you can very easily explain to doctors.

However, before we wind up, here is a small exercise for you.

Small Exercise:

Decision tree models can suffer from extremely high variance. A small change in the training data can give you very different results. This short exercise is designed to make this point. In the code-chunk given below change the values, one at a time, for the following parameters, run the code, and then observe how the decision tree model changes:

set.seed (a): Set the seed to a different number: ‘1234’ or ‘1729’ or ‘9999’ or whatever you like

Training set proportion (p): Set the proportion to different numbers: ‘70%’ or ‘80%’, ‘90%’ or whatever you like

You can go ahead and use the code till the calculation of the prediction error but even plotting the fitted tree would help!

## You should keep the original data frame intact so let's make a copy cardioplay  
cardioplay <- cardio 

## you set the seed to ensure that you get the same training v/s. test split every time you run the code
## Keeping all else constant, you should change the seed from '1234' to any other number 
a <- as.numeric(1234) 


## randomly extract row numbers in cardio dataset which will be included in the training set
## Keeping all else constant, you should change the proportion from '50%' to any other proportion 
p <- as.numeric(0.50)
## You don't need to make any changes in this code-chunk
## Make changes in the code-chunk just above and observe the changes in the output of this code-chunk  

## seed 
set.seed(a) 

## rows in training data 
trainset <- sample(1:nrow(cardioplay), round(p*nrow(cardioplay),0))
cardioplay.train <- cardio[trainset, ]

## rows in test data  
cardioplay.test <- cardio[-trainset,  ] 

## fit the tree 
cardioplay.train.fit <- rpart(status ~ . , data = cardioplay.train, method = "class") 

## plot the tree 
fancyRpartPlot(cardioplay.train.fit, palettes = c("Greens", "Reds"), sub = "")


Conclusion

Now, you have a good understanding of how to perform the exploratory data analysis and prepare your dataset, before you could set out to build a decision tree. You are also familiar with various functions in the rpart package with which you can build decision trees, plot the trees and prune decision trees to build. As we have discussed earlier, there are other tree-based approaches such as BaggingRandom Forests, and Boosting which improve the accuracy.

You are all set to go start practicing exercises on these advanced topics!

Phuc Duong
| March 16, 2016

The bike-sharing dataset will be a perfect example to build a Random Forest model in Azure Machine Learning and R in this custom R models’ blog.

The bike-sharing dataset includes the number of bikes rented for different weather conditions. From the dataset, we can build a model that will predict how many bikes will be rented during certain weather conditions.

About Azure machine learning data

Azure Machine Learning Studio has a couple of dozen built-in machine learning algorithms. But what if you need an algorithm that is not there? What if you want to customize certain algorithms? Azure can use any R or Python-based machine learning package and associated algorithms! It’s called the “create model” module. With it, you can leverage the entire open-sourced R and Python communities.

The Bike Sharing dataset is a great data set for exploring Azure ML’s new R-script and R-model modules. The R-script allows for easy feature engineering from date-times and the R-model module lets us take advantage of R’s random Forest library. The data can be obtained from Kaggle; this tutorial specifically uses their “train” dataset.

The Bike Sharing dataset has 10,886 observations, each one about a specific hour from the first 19 days of each month from 2011 to 2012. The dataset consists of 11 columns that record information about bike rentals: date-time, season, working day, weather, temp, “feels like” temp, humidity, wind speed, casual rentals, registered rentals, and total rentals.

Feature engineering & preprocessing

There is an untapped wealth of prediction power hidden in the “DateTime” column. However, it needs to be converted from its current form. Conveniently, Azure ML has a module for running R scripts, which can take advantage of R’s built-in functionality for extracting features from the date-time data.

Since Azure ML automatically converts date-time data to date-time objects, it is easiest to convert the “DateTime” column to a string before sending it to the R script module. The date-time conversion function expects a string, so converting beforehand avoids formatting issues.

 

Azure machine learning model

 

We now select an R-Script Module to run our feature engineering script. This module allows us to import our dataset from Azure ML, add new features, and then export our improved data set. This module has many uses beyond our use in the tutorial, which help with cleaning data and creating graphs.

Our goal is to convert the DateTime column of strings into date-time objects in R, so we can take advantage of their built-in functionality. R has two internal implementations of date-times: POSIXlt and POSIXct. We found Azure ML had problems dealing with POSIXlt, so we recommend using POSIXct for any date-time feature engineering.

The function as.POSIXct converts the DateTime column from a string in the specified format to a POSIXct object. Then we use the built-in functions for POSIXct objects to extract the weekday, month, and quarter for each observation. Finally, we use substr() to snip out the year and hour from the newly formatted date-time data.

Remove problematic data

This dataset only has one observation where weather = 4. Since this is a categorical variable, R will result in an error if it ends up in the test data split. This is because R expects the number of levels for each categorical variable to equal the number of levels found in the training data split. Therefore, it must be removed.

 #Bike sharing data set as input to the module 
dataset <-maill.mapInputPort(1) 
#extracting hour, weekday, month, and year fromthe  dataset
dataset$datetime <- as.POSIXct(dataset$datetime, format = "%m/%d/%Y %I:%M:%S %p")
dataset$hour  <- substr(dataset$datetime,	12,13)
dataset$weekday  <- weekdays(dataset$datetime)
dataset$month  <- months(dataset$datetime)
dataset$year  <- substr(dataset$datetime,	1,4)
#Preserving the column order 
Count <- dataset[,names(dataset) %in% c("count")]
 OtherColumns <- dataset[,!names(dataset) %in% c("count")]
dataset <- cbind(OtherColumns,Count)
#Remova e single observation with weather = 4 to preventhe t scoring model from failing
dataset <- subset(dataset, weather != '4')

#Return the dataset after appending the new featuresmailml.mapOutputPort("dataset");

Define categorical variables

Before training our model, we must tell Azure ML which variables are categorical. To do this, we use the Metadata Editor. We used the column selector to choose the hour, weekday, month, year, season, weather, holiday, and working day columns.

Then we select “Make categorical” under the “Categorical” dropdown.

Drop low-value columns

Before creating our random forest, we must identify columns that add little-to-no value for predictive modeling. These columns will be dropped.

Since we are predicting the total count, the registered bike rental and casual bike rental columns must be dropped. Together, these values add upthe  to total count, which would lead to a successful but uninformative model because the values would simply be summed to see the total count. One could train separate models to predict casual and registered bike rentals independently. Azure ML would make it very easy to include these models in our experiment after creating one for total count.

 

Dropping Low Value Columns - Azure machine learning

 

The third candidate for removal is the DateTime column. Each observation has a unique date-time, so this column just add noise to our model, especially since we extracted all the useful information (day of the week, time of day, etc.)

Now that the dropped columns have been chosen, drag in the “Project Columns” module to drop DateTime, casual, and registered. Launch the column selector and select “All columns” from the dropdown next to “Begin With.” Change “Include” to “Exclude” using the dropdown and then select the columns we are dropping.

Specify a response class

We must now directly tell Azure ML which attribute we want our algorithm to train to predict by casting that attribute as a “label”.

Start by dragging in a metadata editor. Use the column selector to specify “Count” and change the “Fields” parameter to “Labels.” A dataset can only have 1 label at a time for this to work.

Our model is now ready for machine learning!

model for machine learning

Model building

Train your model

Here is where we take advantage of AzureMl’s newest feature: the Create R Model module. Now we can use R’s randomForest library and take advantage of its large number of adjustable parameters directly inside AzureML studio. Then, the model can be deployed in a web service. Previously, R models were nearly impossible to deploy to the web. For a detailed explanation of setting up data partitions and model training check out our other tutorial here.

 

Train your r models

Similar to a native model in Azure ML, the Create R Model module connects to the Train Model module. The difference is the user must provide an R code for training and scoring separately. The training script goes under “Trainer R script” and takes in one dataset as an input and outputs a model. The dataset corresponds to whichever dataset gets input to the connected Train Module.

In this case, the dataset is our training split and the model output is a random forest. The scoring script goes under “Scorer R script” and has two inputs: a model and a dataset. These correspond to the model from the Train Model module and the dataset input to the Score Model module, which is the test split in this example.

The output is a data frame of the predicted values, which get appended to the original dataset. Make sure to appropriately label your outputs for both scripts as Azure ML expects exact variable names.

#Trainer R Script
#Input: dataset
#Output: model
library(randomForest)
model <- randomForest(Count ~ ., dataset)
 #Scorer R Script
#Input: model, dataset
#Output: scores
library(randomForest)
scores <- data.frame(predict(model, subset(dataset, select = -c(Count))))
names(scores) <- c("Predicted Count")

Evaluate your model

Model building - evaluation

Unfortunately, AzureML’s Evaluate Model Module does not support models that use the Create R Model module, yet. We assume this feature will be added in the near future.

In the meantime, we can import the results from the scored model (Score Model module) into an Execute R Script module and compute an evaluation using R. We calculated the MSE then exported our result back to AzureML as a data frame.

#Results as input to module
dataset1 <- maml.mapInputPort(1)
countMSE <- mean((dataset1$Count-dataset1["Predicted Count"])^2)
evaluation <- data.frame(countMSE)
#Output evaluation
maml.mapOutputPort("evaluation");

 

Syed Saad Peerzada
| July 17, 2022

Data Science Dojo has launched Jupyter Hub for Machine Learning using Python offering to the Azure Marketplace with pre-installed machine learning libraries and pre-cloned GitHub repositories of famous machine learning books which help the learner to take the first steps into the field of machine learning.

What is machine learning?

Machine learning is a sub-field of Artificial Intelligence. It is an innovative technology that allows machines to learn from historical data and provide the best results to predict outcomes.

Machine learning using Python

Machine learning requires exploratory data analysis, data processing, and the training of data to predict outcomes. Python provides a vast number of libraries and frameworks that let the user collect, analyze and transform data by just using built-in functions provided by the library which makes coding easy and also saves a significant amount of time.

machine learning python
Machine learning using Python

 PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your machine learning skills.

Challenges for individuals

Individuals who are new to machine learning and want to excel in their path in machine learning usually lack computing as well as learning resources to gain hands-on experience with machine learning. A beginner in machine learning also faces compatibility issues while installing libraries.

What we provide

With just a single click, Jupyter Hub for Machine Learning using Python comes with pre-installed machine learning python libraries, which gives the learner an effortless coding environment in the Azure cloud and reduces the burden of installation. Moreover, this offer provides the learner with repositories of famous books on machine learning which contain chapter-wise notebooks which serve as a learning resource for a user in gaining hands-on experience with machine learning. The heavy computations required for Machine Learning applications are not performed on the user’s local machine. Instead, they are performed in the Azure cloud, which increases responsiveness and processing speed.

Listed below are the pre-installed machine learning python libraries and the sources of repositories of machine learning books provided by this offer:

Python libraries

  • Pandas
  • NumPy
  • scikit-learn
  • mlpack
  • matplotlib
  • SciPy
  • Theano
  • Pycaret
  • Orange3
  • seaborn

Repositories

  •  Github repository of book ‘Python Machine Learning Book 1st Edition’, by author Sebastian Raschka.
  •  Github repository of book ‘Python Machine Learning Book 2nd Edition’, by author Sebastian Raschka.
  •  Github repository of the book ‘Hands-on Machine Learning with Scikit Learn, Keras, and TensorFlow’, by author Geron-Aurelien.
  •  Github repository of ‘Microsoft Azure Cloud Advocates 12-week Machine Learning curriculum’.

Conclusion

Jupyter Hub for Machine Learning using Python provides an in-browser coding environment with just a single click, hence providing ease of installation. Through this offer, a user can work on a variety of machine learning applications including stock market trading, email spam and malware filtering, product recommendations, online customer support, medical diagnosis, online fraud detection, and image recognition.

Jupyter Hub for Machine Learning using Python offered by Data Science Dojo is ideal to learn more about machine learning without the need to worry about configurations and computing resources. The heavy resource requirement for processing and training large data for these applications is no longer an issue as data-intensive computations are now performed on Microsoft Azure which increases processing speed.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Jupyter Notebook Environment dedicated specifically for Machine Learning using Python. The offering leverages the power of Microsoft Azure services to run effortlessly with outstanding responsiveness. Install the Jupyter Hub offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Try Now!

Data Science Dojo
| February 3, 2021

Learn the difference between supervised ML, unsupervised ML, and reinforcement learning. Test your knowledge of machine learning techniques with an interactive infographic.

The quiz below was made to help you test your knowledge of supervised ML, unsupervised ML, and reinforcement learning while understanding which machine learning techniques fall under these categories. Try it or even embed it into your webpage!

Supervised machine learning techniques

In supervised machine learning models, we give the model a dataset with the answers (labels) to learn how to predict the label(s) for other examples where the labels are unknown.

Reinforcement learning

Reinforcement learning, on the other hand, is not trained with the answer. Instead, an agent is either penalized or rewarded for interacting with the environment. It learns from previous attempts and tries to maximize the reward with each attempt.

Unsupervised machine learning techniques

Unsupervised machine learning algorithms find hidden structures between the attributes (features) when the given dataset does not include labels. This is different from supervised learning; in that, we don’t tell the model what it needs to learn.

Quiz yourself!

Want to upgrade your machine learning knowledge? Check out Data Science Dojo’s Instructor-led Data Science Bootcamp.

Phuc Duong
| February 26, 2016

You can use Microsoft Azure ML or Amazon ML to build your machine learning model, but what’s the difference between the two approaches? 

Have you noticed that we have two machine learning demos on our site that allow you to deploy predictive models? The Titanic Survival Predictor is designed to work with a Microsoft Azure model for machine learning.

The AWS Machine Learning Caller is our new demo that connects to an Amazon Machine Learning model.

The idea is that you can use Microsoft Azure ML or Amazon ML to build a machine learning model, and then use our demo to input values for the prediction.

Each ML program provides an endpoint that you can use to access the model and run predictions. Our demos interface with that endpoint and provide a graphic user interface for making predictions.

So, what’s the difference between the machine learning demos?

First of all, the backend is different. But we’ll keep this brief.

The graphic below shows what types of models can be run through the demo.

  • The cruise ship represents the Titanic classification model generated from our Azure ML tutorial.
  • The iris represents any classification model, such as a model used to predict species from a set of measurements.
  • The complicated graph represents a regression model. Regression models are used to predict a number given a set of input numbers.
Titanic Survivor Predictor - machine learning
AWS machine learning caller

You can see that the Titanic model can link to both demos, but the classification (iris) model only links to our Amazon demo. The numerical dataset does not work with either of our demos.

The demos are currently limited to classification models only (because linear regression models work differently and requires a different backend).

MLaaS: User perspectives

From the user perspective, the Titanic Survival Predictor is built for a specific purpose. It interfaces with the exact Titanic classification model that we created for Azure and is included as part of our bootcamp. Users can change all the tuning parameters and make the model unique.

However, the input variables, or “schema” to be labeled the same way as the original model or it won’t work.

So, if you rename one of the columns, the demo will have an error. However, since we published the Azure model online, it’s pretty easy to copy the model and change some parameters.

To get your predictive model to work with our Titanic Survival Predictor demo, you’ll need the following information:

  • Name (used to generate your own url)
  • Post URL (or endpoint)
  • API key

The AWS Machine Learning Caller is not built for a specific dataset like Titanic. It will work with any logistic regression model built in Amazon Machine Learning. When you input your access keys and model id, our demo automatically pulls the schema from Amazon.

It does not require a specific schema like our Titanic Survival Predictor.

To get your predictive model to work with our AWS Machine Learning Caller demo, you’ll need the following information:

  • Access key
  • Secret access key
  • AWS Account Region
  • AWS ML Model ID

Why do two machine learning demos do similar things?

These are training tools for our 5-day bootcamp. We use Microsoft Azure to teach classification models. The software has tools for data cleaning and manipulation. The way that the tools are laid out is visual and easy to understand. It provides a clear organization of the processes: input data, clean data, build a model, evaluate the model, and deploy the model.

Microsoft Azure has been a great way to teach the model-building process.

We’ve recently added Amazon Machine Learning to our curriculum. The program is simpler, where all the processes described above are automated. Amazon ML walks users through the process.

However, it does provide slightly different evaluation metrics than Microsoft Azure, so we use it to teach regression and classification models as well.

Help us get better!

We are always looking for ways to incorporate new tools into our curriculum. If there is a tool that you think we ought to have, please let us know in the comments.

Or, you can contact us here

 

MLaaS: Deploy & host predictive models to employ webservices
Phuc Duong
| March 15, 2016

This Azure tutorial will walk you through deploying a predictive model in Azure Machine Learning, using the Titanic dataset.

MLaas overview:

The classification model, covered in this article, uses the Titanic dataset to predict whether a passenger will live or die, based on demographic information. We’ve already built the model for you and the front-end UI. This tutorial will show you how to customize the Titanic model we built and deploy your own version.

About the data

The Titanic dataset’s complexity scales up with feature engineering, making it one of the few datasets good for both beginners and experts. There are numerous public resources to obtain the Titanic dataset, however, the most complete (and clean) version of the data can be obtained from Kaggle, specifically their “train” data.

The “train” Titanic data ships with 891 rows, each one about a passenger on the RMS Titanic, the night of the disaster. The dataset also has 12 columns that record attributes of each passenger’s circumstances and demographics such as passenger id, passenger class, age, gender, name, number of siblings and spouses aboard, number of parents and children aboard, fare, ticket number, cabin number, port of embarkation, and whether or not they survived.

For additional reading, a repository of biographies about everyone aboard the RMS Titanic can be found here (complete with pictures).

Titanic route

Getting the experiment

About the titanic survival User Interface

From the dataset, we will build a predictive model and deploy the model in AzureML as a web service. Data Science Dojo has built a front-end UI to interact with such a web service.

Click on the link below to view a finished version of this deployed web service.

Titanic Survival Predictor

Use the app to see what your chance of survival might have been if you were on the Titanic. Play around with the different variables. What factors does the model deem important in calculating your predicted survival rate?

The following tutorial will walk you through how to deploy a titanic prediction model as a web service.

 

titanic survival predictor

Get an Azure ML account

This MLaaS tutorial assumes that you already have an AzureML workspace. If you do not, please visit the following link for a tutorial on how to create one.

Creating Azure ML Workspace

Please note that an Azure ML 88-hourfree trial does not have the option of deploying a web service.

If you already have an AzureML workspace, then simply visit:

https://studio.azureml.net/

Clone the experiment

For this MLaaS tutorial, we will provide you with the completed experiment by letting you clone ours. If you are curious about how we created the experiment, please view our companion tutorial where we talk about  where we talk about the process of data mining.

 

clone

Model-Comparison

Our experiment is hosted in the Azure ML public gallery. Navigate to the experiment by clicking on the link below or by clicking “Clone ont to Azure ML” within the Titanic Survival Predictor web page itself. The Azure ML Gallery is a place where people can showcase their experiments within the Azure ML community.

Gallery Titanic Experiment

Click on the “open in studio” button.

The experiment and dataset will be copied to your studio workspace. You should now see a bunch of modules linked together in a workflow. However, since we have not run the experiment, the workflow only a set of instructions which Azure ML will use to build your models. We will have to run the experiment to produce anything.

Click the “run” button at the bottom middle of the AzureML window.

This will execute the workflow that is present within the experiment. The experiment will take about 2 minutes and 30 seconds to finish running. Wait until every module has a green checkmark next to it. This indicates that each module has finished running.

MLaaS model evaluation and deployment

Select an algorithm

You may have noticed that the cloned experiment shipped with two predictive models–two different decision forests. However, because we can only deploy one predictive model, we should see which performs better. Right click on the output node of the evaluate model module and click “visualize.”

 

visualize model

Evaluate your model

For the purpose of this tutorial, we will define the “better” performing model as the one which scored a higher RoC AuC. We will gloss over evaluating performance metrics of classification models since that would require a longer, more in-depth discussion.

In the evaluate model module, you will see a “ROC” graph with a blue and red line graphed on it. The blue line represents the RoC performance of the model on the left and the red line represents the performance of the model on the right.The higher the curve is on the graph, the better the performance. Since the red curve, the right model, is higher on the graph than the blue curve, we can say that the right model is the better performing model in this case. We will now deploy the corresponding decision tree model.

 

evaluate model

Deploy the experiment

Before deployment, all modules must have a green check mark next to them.

To deploy the selected decision forest model, select the “train model module” on the right.

While that is selected, hover over the “setup web service” button on the bottom middle of the screen. A pull-up menu will appear. Select “predictive web service”.

Azure ML will now remove and consolidate unnecessary modules, then it will automatically save the predictive model as a trained model and setup web service inputs and outputs.

 

train model (1)

deploy model

Drop the response class

Our web service is almost complete. However, we need to tune the logic behind the web service function. The score model module is the module that will execute the algorithm against a given dataset. The score model module can also be called the “prediction module” because that is what happens when you apply a trained algorithm against a dataset.

You will notice that the score model module also takes in a dataset on the right input node. When deploying a predictive model, the score model module will need a copy of the required schema. The dataset used to train the model is fed back into the score model module because that is the schema that our trained algorithm currently knows.

However, that schema also hols our response class “survived,” the attribute that we are trying to predict. We must now drop the survived column. To do this we will use the “project columns” module. Search for it in the search bar on the left side of the AzureML window, then drag it into the workspace.

Replicate the picture on the left by connecting the last metadata editor’s output node to the input of the new project columns module. Then connect the output of the new project columns module with the right input of the score model module.

Select the project columns module once the connections have been made. A “properties” window will appear on the right side of the AzureML window. Click on “launch column selector.”

To drop the “Survived” column we will “Begin with: All Columns,” then choose to “Exclude” by “column names,” “Survived.”

 

drop target

drop target - 1

Reroute web service input

We must now point our web service input in the correct direction. The web service input is currently pointing to the beginning of the workflow where data was cleaned, columns were renamed, and columns were dropped. However, the form on the Titanic Prediction App will do the cleansing for you.

Let’s reroute the web service input to point directly at our score model module. Drag the web service input module down toward the score model module and connect it to the right input node of the score model (the same node that the newly added project columns module is also connected to).

Deploy your model

Once all the rerouting has been done, run your experiment one last time. A “Deploy Web Service” button should now be clickable at the bottom middle of the Azure ML window. Click this and AzureML will automatically create and host your web service API with your own endpoints and post-URL.

 

deploy model -1

Exposing the deployed webservice

 

API Diagram

 

Test your webservice

You should now be on the web deployment screen for your web service. Congratulations! You are now in possession of a web service that is connected to a live predictive model. Let’s test this model to see if it behaves properly.

Click the “test” button in the middle of the web deployment screen. A window with a form should popup. This form should look familiar because it is the same form that the Titanic Predictor App was showing you.

Send the form a few values to see what it returns. The predictions will come in JSON format. The last number in JSON is the prediction itself, which should be a decimal akin to a percentage. This percentage is the predicted likelihood of survival based upon the given parameters, or in this case the passenger’s circumstances while aboard the Titanic.

 

test model

 

Find your API key

The API key is located on the web deployment screen, above the test button that you clicked on earlier. The API key input box comes with a copy to clipboard button, click on that button to copy the key. Paste the key into the “Add Your Own Model” page.

 

find API

Get your post URL

To grab the post-URL, click on the “REQUEST/RESPONSE” button, to the left of the test button. This will take you to the API help page.

Under “Request” and to the right of “POST” is the URL. Copy paste this URL into the “Add Your Own Model” form.

 

get POST url

 

get POST url - 1

Enjoy and share

You now have your very own web service! Rememto save the URL because it is your own web page that you may share with others.

If you have a free trial Azure ML account please note that your web service may discontinue when your free trial subscription ends.

 

Related Topics

Top
Statistics
Programming Language
Podcasts
Machine Learning
High-Tech
Events and Conferences
DSD Insights
Discussions
Development and Operations
Demos
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Books
Blogs

Finding our reads interesting?

Become a contributor today and share your data science insights with the community

Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.