fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Python libraries

Ali Haider - Author
Ali Haider Shalwani
| November 10

Generative AI is a branch of artificial intelligence that focuses on the creation of new content, such as text, images, music, and code. This is done by training machine learning models on large datasets of existing content, which the model then uses to generate new and original content. 

 

Want to build a custom large language model? Check out our in-person LLM bootcamp. 


Popular Python libraries for Generative AI

 

Python libraries for generative AI  | Data Science Dojo
Python libraries for generative AI

 

Python is a popular programming language for generative AI, as it has a wide range of libraries and frameworks available. Here are 10 of the top Python libraries for generative AI: 

 1. TensorFlow:

TensorFlow is a popular open-source machine learning library that can be used for a variety of tasks, including generative AI. TensorFlow provides a wide range of tools and resources for building and training generative models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

TensorFlow can be used to train and deploy a variety of generative models, including: 

  • Generative adversarial networks (GANs) 
  • Variational autoencoders (VAEs) 
  • Transformer-based text generation models 
  • Diffusion models 

TensorFlow is a good choice for generative AI because it is flexible and powerful, and it has a large community of users and contributors. 

 

2. PyTorch:

PyTorch is another popular open-source machine learning library that is well-suited for generative AI. PyTorch is known for its flexibility and ease of use, making it a good choice for beginners and experienced users alike. 

PyTorch can be used to train and deploy a variety of generative models, including: 

  • Conditional GANs 
  • Autoregressive models 
  • Diffusion models 

PyTorch is a good choice for generative AI because it is easy to use and has a large community of users and contributors. 

 

Large language model bootcamp

 

3. Transformers:

Transformers is a Python library that provides a unified API for training and deploying transformer models. Transformers are a type of neural network architecture that is particularly well-suited for natural language processing tasks, such as text generation and translation.

Transformers can be used to train and deploy a variety of generative models, including: 

  • Transformer-based text generation models, such as GPT-3 and LaMDA 

Transformers is a good choice for generative AI because it is easy to use and provides a unified API for training and deploying transformer models. 

 

4. Diffusers:

Diffusers is a Python library for diffusion models, which are a type of generative model that can be used to generate images, audio, and other types of data. Diffusers provides a variety of pre-trained diffusion models and tools for training and fine-tuning your own models.

Diffusers can be used to train and deploy a variety of generative models, including: 

  • Diffusion models for image generation 
  • Diffusion models for audio generation 
  • Diffusion models for other types of data generation 

 

Diffusers is a good choice for generative AI because it is easy to use and provides a variety of pre-trained diffusion models. 

 

 

5. Jax:

Jax is a high-performance numerical computation library for Python with a focus on machine learning and deep learning research. It is developed by Google AI and has been used to achieve state-of-the-art results in a variety of machine learning tasks, including generative AI. Jax has a number of advantages for generative AI, including:

  • Performance: Jax is highly optimized for performance, making it ideal for training large and complex generative models. 
  • Flexibility: Jax is a general-purpose numerical computing library, which gives it a great deal of flexibility for implementing different types of generative models. 
  • Ecosystem: Jax has a growing ecosystem of tools and libraries for machine learning and deep learning, which can be useful for developing and deploying generative AI applications. 

Here are some examples of how Jax can be used for generative AI: 

  • Training generative adversarial networks (GANs) 
  • Training diffusion models 
  • Training transformer-based text generation models 
  • Training other types of generative models, such as variational autoencoders (VAEs) and reinforcement learning-based generative models 

 

Get started with Python, checkout our instructor-led live Python for Data Science training.  

 

6. LangChain: 

LangChain is a Python library for chaining multiple generative models together. This can be useful for creating more complex and sophisticated generative applications, such as text-to-image generation or image-to-text generation.

Overview of LangChain Modules
Overview of LangChain Modules

LangChain is a good choice for generative AI because it makes it easy to chain multiple generative models together to create more complex and sophisticated applications.  

 

7. LlamaIndex:

LlamaIndex is a Python library for ingesting and managing private data for machine learning models. LlamaIndex can be used to store and manage your training datasets and trained models in a secure and efficient way.

 

LlamaIndex is a good choice for generative AI because it makes it easy to store and manage your training datasets and trained models in a secure and efficient way. 

 

8. Weight and biases:

Weight and Biases (W&B) is a platform that helps machine learning teams track, monitor, and analyze their experiments. W&B provides a variety of tools and resources for tracking and monitoring your generative AI experiments, such as:

  • Experiment tracking: W&B makes it easy to track your experiments and see how your models are performing over time. 
  • Model monitoring: W&B monitors your models in production and alerts you to any problems. 
  • Experiment analysis: W&B provides a variety of tools for analyzing your experiments and identifying areas for improvement. 


Learn to build LLM applications

 

9. Acme:

Acme is a reinforcement learning library for TensorFlow. Acme can be used to train and deploy reinforcement learning-based generative models, such as GANs and policy gradients.

Acme provides a variety of tools and resources for training and deploying reinforcement learning-based generative models, such as: 

  • Reinforcement learning algorithms: Acme provides a variety of reinforcement learning algorithms, such as Q-learning, policy gradients, and actor-critic. 
  • Environments: Acme provides a variety of environments for training and deploying reinforcement learning-based generative models. 
  • Model deployment: Acme provides tools for deploying reinforcement learning-based generative models to production. 

 

 Python libraries help in building generative AI applications

These libraries can be used to build a wide variety of generative AI applications, such as:

  • Chatbots: Chatbots can be used to provide customer support, answer questions, and engage in conversations with users.
  • Content generation: Generative AI can be used to generate different types of content, such as blog posts, articles, and even books.
  • Code generation: Generative AI can be used to generate code, such as Python, Java, and C++.
  • Image generation: Generative AI can be used to generate images, such as realistic photos and creative artwork.

Generative AI is a rapidly evolving field, and new Python libraries are being developed all the time. The libraries listed above are just a few of the most popular and well-established options.

Syed Saad Peerzada
Syed Saad Peerzada
| August 4

What is web scraping?

Web scraping is the act of extracting the content and data from a website. The vast amount of data available on the internet is not open and available to download. As a result, ethical web scraping is the most effective technique to collect this data. There is also a debate about the legality of web scraping as the content may get stolen or the website can crash as a result of web scraping.

Ethical Web Scraping is the act of harvesting data legally by following ethical rules about web scraping. There are certain rules in ethical web scraping that when followed ensure trust between the website owner and web scraper.

Web scraping using Python

In Python, a learner can write a small piece of code to do large tasks. Since web scraping is used to save time, a small code written in Python can save a lot of time. Also, Python is simple and easy to understand and provides an extensive set of libraries for web scraping and further manipulation required on extracted data.

PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your web scraping skills.

Challenges for individuals

Individuals who are new to web scraping and wish to flourish in their field usually lack the necessary computing and learning resources to obtain hands-on expertise. Also, they may face compatibility issues when installing libraries.

What we provide

With just a single click, Jupyter Hub for Ethical Web Scraping using Python comes with pre-installed Web Scraping python libraries, which gives the learner an effortless coding environment in the Azure cloud and reduces the burden of installation. Moreover, this offer provides the learner with a repository of the famous book on web scraping which contains chapter-wise notebooks which serve as a learning resource for a user in gaining hands-on experience with web scraping.

Through this offer, a learner can collect data from various sources legally by following the best practices for ethical web scraping mentioned in the latter section of this blog. Once the data is collected, it can be further analyzed to get valuable insights into almost everything while all the heavy computations are performed on Microsoft Azure hence saving the user from the trouble of running high computations on the local machine.

Python libraries:

Listed below are the pre-installed web scraping python libraries and the sources of repositories of web scraping book provided by this offer:

  •          Pandas
  •          NumPy
  •          Scikit-learn
  •          Beautifulsoup4
  •          lxml
  •          MechanicalSoup
  •          Requests
  •          Scrapy
  •          Selenium
  •          urllib

Repository:

  •          GitHub repository of book Web Scraping with Python 2nd Edition,
    by author Ryan Mitchell.

Best practices for ethical web scraping

Globally, there is a debate about whether web scraping is an ethical concept or not. The reason it is unethical is that when a website is queried repeatedly by the same user (in this case bot), too many requests land on the server simultaneously and all resources of the server may be consumed in generating responses for each request, preventing it from responding to other legitimate users.

In this way, the server denies responses to any further users, commonly known as a Denial of Service (DoS) attack.

Below are the best practices for ethical web scraping, and compliance with these will allow a web scraper to work ethically.

1.   Check out for ROBOTS.TXT

Robots.txt file, also known as the Robots Exclusion Standard, is used to inform the web scrapers if the website can be crawled or not, if yes then how to index the website. A legitimate web scraper is expected to respect the instructions in this file and not disobey the website owner’s allowed instructions.

2.   Check for website APIs

An ethical web scraper is expected to first look for the public API of the website in question instead of scraping it all together. Many website owners provide public API access which can be used by anyone looking to gain from the information available on the website. Provision of public API works in the best interests of both the ethical scrapper as well as the website owner, avoiding web scraping altogether.

3.   Avoid repeated requests

Vigorous scraping can occasionally cause functionality issues, resulting in a poor user experience for humans. As a result, it is always advised to scrape during off-peak hours. An ethical web scraper is expected to delay recurrent requests to avoid a DoS attack.

4.   Provide your identity

It is always a good idea to take responsibility for one’s actions. An ethical web scraper never hides his or her identity and provides it in a user-agent string. Not only does this make the intentions of the scraper clear but also provides a means of contact for any questions or concerns of the website owner.

5.   Avoid fake ownership

The content scraped through web scraper should always be respected and never passed on under the fake information of scraper as the author. This act can be regarded as highly unethical as well as illegal since the website owner may file a copyright claim. It also damages the reputation of genuine web scrapers and hurts the trust of the website owner.

6.  Ask for permission

Since the website information belongs to the owner, one should never presume it to be free and ask politely to use it for their means. An ethical web scraper always seeks permission from the website owner to avoid any future problems. The website owner should be given the choice of whether she agrees to scrape the data.

 7.  Give due credit

To encourage the website owner as a token of thanks, the web scraper should give due credit wherever possible. This can be done in many ways such as providing a link to the original website on any blog, article, or social media post by generating traffic for the original website.

Ethical web scraping

Conclusion

Ethical web scraping is a two-way street in which the website owner should be mindful of the global availability of the data, similarly, the scraper should not harm the website in any way and also first seek permission from the website owner. If a web scraper abides by the above-mentioned practices, I.e., he/she works ethically, the web owner may not only allow scraping his or her website but also provide helpful means to the scraper in the form of Meta data or a public API.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Jupyter Notebook Environment dedicated specifically for Ethical Web Scraping using Python. Install the Jupyter Hub offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Try now - CTA

Data Science Dojo
Ali Mohsin
| July 7

Data Science Dojo has launched Jupyter Hub for Data Visualization using Python offering to the Azure Marketplace with pre-installed data visualization libraries and pre-cloned GitHub repositories of famous books, courses, and workshops which enable the learner to run the example codes provided.

What is data visualization?

It is a technique that is utilized in all areas of science and research. We need a mechanism to visualize the data so we can analyze it because the business sector now collects so much information through data analysis. By providing it with a visual context through maps or graphs, it helps us understand what the information means. As a result, it is simpler to see trends, patterns, and outliers within huge data sets because the data is easier for the human mind to understand and pull insights from the data.

Data visualization using Python

It may assist by conveying data in the most effective manner, regardless of the industry or profession you have chosen. It is one of the crucial processes in the business intelligence process, takes the raw data, models it, and then presents the data so that conclusions may be drawn. Data scientists are developing machine learning algorithms in advanced analytics to better combine crucial data into representations that are simpler to comprehend and interpret.

Given its simplicity and ease of use, Python has grown to be one of the most popular languages in the field of data science over the years. Python has several excellent visualization packages with a wide range of functionality for you whether you want to make interactive or fully customized plots.

PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your visualization skills.

Data visualization using Python
Using Python to visualize Data

Challenges for individuals

Individuals who want to visualize their data and want to start visualizing data using some programming language usually lack the resources to gain hands-on experience with it. A beginner in visualization with programming language also faces compatibility issues while installing libraries.

What we provide

Our Offer, Jupyter Hub for Visualization using Python solves all the challenges by providing you with an effortless coding environment in the cloud with pre-installed Data Visualization python libraries which reduces the burden of installation and maintenance of tasks hence solving the compatibility issues for an individual.

Additionally, our offer gives the user access to repositories of well-known books, courses, and workshops on data visualization that include useful notebooks which is a helpful resource for the users to get practical experience with data visualization using Python. The heavy computations required for applications to visualize data are not performed on the user’s local machine. Instead, they are performed in the Azure cloud, which increases responsiveness and processing speed.   

Listed below are the pre-installed data visualization using python libraries and the sources of repositories of a book to visualize data, a course, and a workshop provided by this offer:

Python libraries:

  • NumPy
  • Matplotlib
  • Pandas
  • Seaborn
  • Plotly
  • Bokeh
  • Plotnine
  • Pygal
  • Ggplot
  • Missingno
  • Leather
  • Holoviews
  • Chartify
  • Cufflinks

Repositories:

  • GitHub repository of the book Interactive Data Visualization with Python, by author Sharath Chandra Guntuku, AbhaBelorkar, Shubhangi Hora, Anshu Kumar.
  • GitHub repository of Data Visualization Recipes in Python, by Theodore Petrou.
  • GitHub repository of Python data visualization workshop, by Stefanie Molin (Author of “Hands-On Data Analysis with Pandas”).
  • GitHub repository Data Visualization using Matplotlib, by Udacity.

Conclusion:

Because the human brain is not designed to process such a large amount of unstructured, raw data and turn it into something usable and understandable form, we require techniques to visualize data. We need graphs and charts to communicate data findings so that we can identify patterns and trends to gain insight and make better decisions faster. Jupyter Hub for Data Visualization using Python provides an in-browser coding environment with just a single click, hence providing ease of installation. Through our offer, a user can explore various application domains of data visualizations without worrying about the configuration and computations.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Jupyter Notebook Environment dedicated specifically for Data Visualization using Python. The offering leverages the power of Microsoft Azure services to run effortlessly with outstanding responsiveness. Make your complex data understandable and insightful with us and Install the Jupyter Hub offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Try Now!

Data Science Dojo
Syed Saad Peerzada
| July 17

Data Science Dojo has launched Jupyter Hub for Machine Learning using Python offering to the Azure Marketplace with pre-installed machine learning libraries and pre-cloned GitHub repositories of famous machine learning books which help the learner to take the first steps into the field of machine learning.

What is machine learning?

Machine learning is a sub-field of Artificial Intelligence. It is an innovative technology that allows machines to learn from historical data and provide the best results to predict outcomes.

Machine learning using Python

Machine learning requires exploratory data analysis, data processing, and the training of data to predict outcomes. Python provides a vast number of libraries and frameworks that let the user collect, analyze and transform data by just using built-in functions provided by the library which makes coding easy and also saves a significant amount of time.

machine learning python
Machine learning using Python

 PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your machine learning skills.

Challenges for individuals

Individuals who are new to machine learning and want to excel in their path in machine learning usually lack computing as well as learning resources to gain hands-on experience with machine learning. A beginner in machine learning also faces compatibility issues while installing libraries.

What we provide

With just a single click, Jupyter Hub for Machine Learning using Python comes with pre-installed machine learning python libraries, which gives the learner an effortless coding environment in the Azure cloud and reduces the burden of installation. Moreover, this offer provides the learner with repositories of famous books on machine learning which contain chapter-wise notebooks which serve as a learning resource for a user in gaining hands-on experience with machine learning. The heavy computations required for Machine Learning applications are not performed on the user’s local machine. Instead, they are performed in the Azure cloud, which increases responsiveness and processing speed.

Listed below are the pre-installed machine learning python libraries and the sources of repositories of machine learning books provided by this offer:

Python libraries

  • Pandas
  • NumPy
  • scikit-learn
  • mlpack
  • matplotlib
  • SciPy
  • Theano
  • Pycaret
  • Orange3
  • seaborn

Repositories

  •  Github repository of book ‘Python Machine Learning Book 1st Edition’, by author Sebastian Raschka.
  •  Github repository of book ‘Python Machine Learning Book 2nd Edition’, by author Sebastian Raschka.
  •  Github repository of the book ‘Hands-on Machine Learning with Scikit Learn, Keras, and TensorFlow’, by author Geron-Aurelien.
  •  Github repository of ‘Microsoft Azure Cloud Advocates 12-week Machine Learning curriculum’.

Conclusion

Jupyter Hub for Machine Learning using Python provides an in-browser coding environment with just a single click, hence providing ease of installation. Through this offer, a user can work on a variety of machine learning applications including stock market trading, email spam and malware filtering, product recommendations, online customer support, medical diagnosis, online fraud detection, and image recognition.

Jupyter Hub for Machine Learning using Python offered by Data Science Dojo is ideal to learn more about machine learning without the need to worry about configurations and computing resources. The heavy resource requirement for processing and training large data for these applications is no longer an issue as data-intensive computations are now performed on Microsoft Azure which increases processing speed.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Jupyter Notebook Environment dedicated specifically for Machine Learning using Python. The offering leverages the power of Microsoft Azure services to run effortlessly with outstanding responsiveness. Install the Jupyter Hub offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Try Now!

Data Science Dojo
Ali Mohsin
| July 18

Data Science Dojo has launched  Jupyter Hub for Computer Vision using Python offering to the Azure Marketplace with pre-installed libraries and pre-cloned GitHub repositories of famous Computer Vision books and courses which enables the learner to run the example codes provided.

What is computer vision?

It is a field of artificial intelligence that enables machines to derive meaningful information from visual inputs.

Computer vision using Python

In the world of computer vision, Python is a mainstay. Even if you are a beginner or the language application you are reviewing was created by a beginner, it is straightforward to understand code. Because the majority of its code is extremely difficult, developers can devote more time to the areas that need it.

 

computer vision python
Computer vision using Python

Challenges for individuals

Individuals who want to understand digital images and want to start with it usually lack the resources to gain hands-on experience with Computer Vision. A beginner in Computer Vision also faces compatibility issues while installing libraries along with the following:

  1. Image noise and variability: Images can be noisy or low quality, which can make it difficult for algorithms to accurately interpret them.
  2. Scale and resolution: Objects in an image can be at different scales and resolutions, which can make it difficult for algorithms to recognize them.
  3. Occlusion and clutter: Objects in an image can be occluded or cluttered, which can make it difficult for algorithms to distinguish them.
  4. Illumination and lighting: Changes in lighting conditions can significantly affect the appearance of objects in an image, making it difficult for algorithms to recognize them.
  5. Viewpoint and pose: The orientation of objects in an image can vary, which can make it difficult for algorithms to recognize them.
  6. Occlusion and clutter: Objects in an image can be occluded or cluttered, which can make it difficult for algorithms to distinguish them.
  7. Background distractions: Background distractions can make it difficult for algorithms to focus on the relevant objects in an image.
  8. Real-time performance: Many applications require real-time performance, which can be a challenge for algorithms to achieve.

 

What we provide

Jupyter Hub for Computer Vision using the language solves all the challenges by providing you an effortless coding environment in the cloud with pre-installed computer vision python libraries which reduces the burden of installation and maintenance of tasks hence solving the compatibility issues for an individual.

Moreover, this offer provides the learner with repositories of famous books and courses on the subject which contain helpful notebooks which serve as a learning resource for a learner in gaining hands-on experience with it.

The heavy computations required for its applications are not performed on the learner’s local machine. Instead, they are performed in the Azure cloud, which increases responsiveness and processing speed.

Listed below are the pre-installed python libraries and the sources of repositories of Computer Vision books provided by this offer:

Python libraries

  • Numpy
  • Matplotlib
  • Pandas
  • Seaborn
  • OpenCV
  • Scikit Image
  • Simple CV
  • PyTorch
  • Torchvision
  • Pillow
  • Tesseract
  • Pytorchcv
  • Fastai
  • Keras
  • TensorFlow
  • Imutils
  • Albumentations

Repositories

  • GitHub repository of book Modern Computer Vision with PyTorch, by author V Kishore Ayyadevara and Yeshwanth Reddy.
  • GitHub repository of Computer Vision Nanodegree Program, by Udacity.
  • GitHub repository of book OpenCV 3 Computer Vision with Python Cookbook, by author Aleksandr Rybnikov.
  • GitHub repository of book Hands-On Computer Vision with TensorFlow 2, by authors Benjamin Planche and Eliot Andres.

Conclusion

Jupyter Hub for Computer Vision using Python provides an in-browser coding environment with just a single click, hence providing ease of installation. Through this offer, a learner can dive into the world of this industry to work with its various applications including automotive safety, self-driving cars, medical imaging, fraud detection, surveillance, intelligent video analytics, image segmentation, and code and character reader (or OCR).

Jupyter Hub for Computer Vision using Python offered by Data Science Dojo is ideal to learn more about the subject without the need to worry about configurations and computing resources. The heavy resource requirement to deal with large Images, and process and analyzes those images with its techniques is no more an issue as data-intensive computations are now performed on Microsoft Azure which increases processing speed.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Jupyter Notebook Environment dedicated specifically for it using Python. Install the Jupyter Hub offer now from the Azure Marketplace, your ideal companion in your journey to learn data science!

Try Now!

Data Science Dojo

Finding the top Python packages for data science and libraries that aren’t only popular, but get the job done isn’t easy. Here’s a list to help you out.

Out of all the Python scientific libraries and packages available, which ones are not only popular but the most useful in getting the job done?

Python packages for data science and libraries

To help you filter down a list of libraries and packages worth adding to your data science toolbox, we have compiled our top picks for aspiring and practicing data scientists. But you’ll also want to know how to best use these tools for tricky, real-world data problems. So instead of leaving you with yet another top-choice list among a quintillion list, we explain how to make the most of these libraries using real-world examples.

You can learn more about how these packages fit into data science with Data Science Dojo’s introduction to Python course.

Data manipulation

Pandas

There’s a reason why pandas consistently tops published ranks on data science-related libraries in Python. The library can help you with a variety of tasks, but it is particularly useful for data manipulation or data wrangling. It can save you a lot of leg work in not only your typical rudimentary data manipulation tasks but also in handling some pretty tricky problems you might encounter when slicing and filtering.

Multi-indexed data can be one of these tricky tasks. The library pandas takes care of advanced indexing, including multi-indexing, where you might need to work with higher-dimensional data or multiple index levels. For example, the number of user interactions might be indexed by 1) product category, 2) the time of day the user interacted with the product, and 3) the location of the user.

Instead of your typical table of rows and columns to represent the data, you might find it better to organize the number of user interactions into all cases that fall under the x product category, with y time of day, and z location. This way you can easily see user interactions across each condition of product category, time of day, and user location. This saves you from having to apply a filter or group for all combinations of conditions in your traditional row-and-table structure.

Here is one way to multi-index data in pandas. With less than a few lines of code, pandas makes this easy to implement in Python:

import pandas as pd

data_multi_indx = table_data.set_index(['Product', 'Day of Week'])
print(data_multi_indx)
'''
Output:
                      Location  Num User Interactions
Product   Day of Week
Product 1 Morning            A                      3
          Morning            B                     90
          Morning            C                      7
          Afternoon          A                     17
          Afternoon          B                      1
          Afternoon          C                     82
Product 2 Morning            A                     27
          Morning            B                     70
          Morning            C                      3
          Afternoon          A                      1
          Afternoon          B                      1
          Afternoon          C                     98
Product 3 Morning            A                     94
          Morning            B                      5
          Morning            C                      1
          Afternoon          A                      0
          Afternoon          B                      7
          Afternoon          C                     93
'''

For the more rudimentary data manipulation tasks, pandas doesn’t require much effort on your part. You can simply use the functions available for imputing missing values, one-hot encoding, dropping columns and rows, and so on.

Here are a few example classes and functions  pandas that make rudimentary data manipulation easy in a few lines of code, at most.

For more lessons with Pandas, visit Data Independent.

Feature Description
fillna(value) Fill in missing values on a column or the whole data frame with a value such as the mean, median, or mode.
isna(data)/isnull(data) Check for missing values.
get_dummies(data_frame['Column']) Apply one-hot encoding on a column.
to_numeric(data_frame['Column']) Convert a column of values from strings to numeric values.
to_string(data_frame['Column']) Convert a column of values from numeric values to strings.
to_datetime(data_frame['Column']) Convert a column of datetimes in string format to standard datetime format.
drop(columns=['Column0','Column1']) Drop specific columns or useless columns in your data frame.
drop(data.frame.index[[rownum0,rownum1]]) Drop specific rows or useless rows in your data frame.

NumPy

Another library that keeps topping the ranks is numpy. This library can handle many tasks, but it is particularly useful when working with multi-dimensional arrays and performing calculations on these arrays. This can be tricky to do in more conventional ways, where you need to find the index of a value or certain values inside another index, with multiple indices.

 

Read about Top Python projects to choose from in 2023

 

This is where numpy shows its strength. Its array() function means standard arrays can be simply added and nicely bundled into a multi-dimensional array. Calculations on these arrays can also be easily implemented using numpy’s vast array (pun intended) of mathematical functions.

Let’s picture an example where numpy’s multi-dimensional arrays are useful. A company tracks or records if a user was/was not shown a mobile product in the morning, afternoon, and night, delivered through a mobile notification. Based on the level of user interaction with the shown product, the company also records a user engagement score.

Data points on each user’s shown product and engagement score are stored inside an array; each array stores these values for each user. The company would like to quickly and simply bundle all user arrays.

In addition to this, using engagement score and purchase history, the company would like to calculate and identify the minimum distance (or difference) across all users’ data points so that users who follow a similar pattern can be categorized and targeted accordingly.

numpy’s array() makes it easy to bundle user arrays into a multi-dimensional array and argmin() and linalg.norm() find the min Euclidean distance between users, as an example of the kinds of calculations that can be done on a multi-dimensional array:

import numpy as np

# Records tracking whether user was/was not shown product during
# morning, afternoon, and night, and user engagement score
user_0 = [0,0,1,0.7]
user_1 = [0,1,0,0.4]
user_2 = [1,0,0,0.0]
user_3 = [0,0,1,0.9]
user_4 = [0,1,0,0.3]
user_5 = [1,0,0,0.0]
# Create a multi-dimensional array to bundle all users
# Can use arrays with mixed data types by specifying 
# the object data type in numpy multi-dimensional arrays
users_multi_dim = np.array([user_0,user_1,user_2,user_3,user_4,user_5],dtype=object)
print(users_multi_dim)
'''
Output:
[[0 0 1 0.7]
 [0 1 0 0.4]
 [1 0 0 0.0]
 [0 0 1 0.9]
 [0 1 0 0.3]
 [1 0 0 0.0]]
'''
# To view which user was/was not shown the product
# either morning, afternoon or night, pandas easily 
# allows you to index and label the data
row_names = [_ for _ in ['User 0','User 1','User 2','User 3','User 4','User 5']]
col_names = [_ for _ in ['Product Shown Morning','Product Shown Afternoon',
                         'Product Shown Night','User Engagement Score']]
users_df_indexed = pd.DataFrame(users_multi_dim,index=row_names,columns=col_names)
print(users_df_indexed)
'''
Output:
       Product Shown Morning Product Shown Afternoon Product Shown Night User Engagement Score
User 0                     0                       0                   1                   0.7
User 1                     0                       1                   0                   0.4
User 2                     1                       0                   0                     0
User 3                     0                       0                   1                   0.9
User 4                     0                       1                   0                   0.3
User 5                     1                       0                   0                     0
'''
# Find which existing user is closest to the engagement 
# and purchase behavior of a new user by calculating the 
# min Euclidean distance on a numpy multi-dimensional array
user_0 = [0.7,51.90,2]
user_1 = [0.4,25.95,1]
user_2 = [0.0,0.00,0]
user_3 = [0.9,77.85,3]
user_4 = [0.3,25.95,1]
user_5 = [0.0,0.00,0]
users_multi_dim = np.array([user_0,user_1,user_2,user_3,user_4,user_5])
new_user = np.array([0.8,77.85,3])
closest_to_new = np.argmin(np.linalg.norm(users_multi_dim-new_user,axis=1))
print('User', closest_to_new, 'is closest to the new user')
'''
Output:
User 3 is closest to the new user
'''

Data modeling

Statsmodels

The main strength of statsmodels is its focus on statistics, going beyond the ‘machine learning out-of-the-box’ approach. This makes it a popular choice for data scientists. Conducting statistical tests to find significantly different variables, checking for normality in your data, checking the standard errors, and so on, cannot be underestimated when trying to build the most effective model you can build. Your model is only as good as your inputs, and statsmodels is designed to help you better understand and customize your inputs.

The library also covers an exhaustive list of predictive models to choose from, depending on your predictors and outcome variable(s). It covers your classic Linear Regression models (including ordinary least squares, weighted least squares, recursive least squares, and more), Generalized Linear models, Linear Mixed Effects models, Binomial and Poisson Bayesian models, Logit and Probit models, Time Series models (including autoregressive integrated moving average, dynamic factor, unobserved component, and more), Hidden Markov models, Principal Components and other techniques for Multivariate models, Kernel Density estimators, and lots more.

Here are the classes and functions in statsmodels that cover the main modeling techniques useful for many prediction tasks.

Classes and functions in statsmodel - Python packages

Scikit-learn

Any library that makes machine learning more accessible and easier to implement is bound to make the top choice list among aspiring and practicing data scientists. The Library scikit-learn not only allows models to be easily implemented out-of-the-box but also offers some auto fine-tuning.

Finding the best possible combination of model parameters is a key example of fine-tuning. The library offers a few good ways to search for the optimal set of parameters, given the algorithm and problem to solve. The grid search and random search algorithms in scikit-learn evaluate different combinations of parameters until they find the best combo that results in the best outcome, or a better-performing model.

The grid search goes through every possible combination, whereas the random search randomly samples the parameters over a fixed number of times/iterations. Cross-validating your model on many subsets of data is also easy to implement using scikit-learn. With this kind of automation, the library offers data scientists a massive time saver when building models.

The library also covers all the essential machine learning models from classification (including Support Vector Machine, Random Forest, etc), to regression (including Ridge Regression, Lasso Regression, etc), and clustering (including k-Means, Mean Shift, etc).

Here are the classes and functions in scikit-learn that cover the main modeling techniques useful for many prediction tasks.

Feature Description
SVC()GaussianNB()LogisticRegression()DecisionTreeClassifier(),

RandomForestClassifier()SGDClassifier()MLPClassifier()

Classification models: Support Vector Machine, Gaussian Naïve Bayes, Logistic Regression, Decision Tree, Random Forest, Stochastic Gradient Descent, Multi-Layer Perceptron
linear_model.Ridge()linear_model.Lasso()SVR()DecisionTreeRegressor(),

RandomForestRegressor()SGDRegressorMLPRegressor()

Regression models: Ridge Regression, Lasso Regression, Support Vector Machine, Decision Tree, Random Forest, Stochastic Gradient Descent, Multi-Layer Perceptron
KMeans()AffinityPropagation()MeanShift()AgglomerativeClustering Clustering models: k-Means, Affinity Propagation, Mean Shift, Agglomerative Hierarchical Clustering

Data visualization

Plotly

The libraries matplotlib and seaborn will easily take care of your basic static plot functions, which are important for your own internal exploration or understanding of the data. But when presenting visual insights to business folks or users, interactivity is where we are headed these days.

Using JavaScript functionality, plotly renders interactive graphs in the form of zooming in and panning out of the graph panel, hovering over objects for more information, and dragging objects into position to further explore relationships in the data. Graphs can be customized to your heart’s content.

Here are just a few of the many tricks that Plotly offers:

Feature Description
hovermodehoverinfo Controls the mode and text when a user hovers over an object.
on_selection()on_click() Allows a user to select or click on an object and have that selected object change color, for example.
update Modifies a graph’s layout and data such as titles and annotations.
animate Creates an animated graph.

Bokeh

Much like plotlybokeh also offers interactive graphs. But one feature that stands out in bokeh is linked interactions. This is useful when keeping separate graphs in unison, where the user interacts with one graph and needs to compare with the other while they are in sync. For example, a user zooms into a graph, effectively changing the range of the graph, and then would like to compare it with the second graph. The second graph would need to automatically update its range so that both graphs can be easily compared like-for-like.

Here are some key tricks that bokeh offers:

Feature Description
figure() Creates a new plot and allows linking to the range of another plot.
HoverTool()hover_glyph Allows users to hover over an object for more information.
selection_glyph Selects a particular glyph object for styling.
Slider() Creates a slider to dynamically update the plot based on the slide range.

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence