fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

pandas

Author image - Ayesha
Ayesha Saleem
| May 1

Python is a powerful and versatile programming language that has become increasingly popular in the field of data science. One of the main reasons for its popularity is the vast array of libraries and packages available for data manipulation, analysis, and visualization.

10 Python packages for data science and machine learning

In this article, we will highlight some of the top Python packages for data science that aspiring and practicing data scientists should consider adding to their toolbox. 

1. NumPy 

NumPy is a fundamental package for scientific computing in Python. It supports large, multi-dimensional arrays and matrices of numerical data, as well as a large library of mathematical functions to operate on these arrays. The package is particularly useful for performing mathematical operations on large datasets and is widely used in machine learning, data analysis, and scientific computing. 

2. Pandas 

Pandas is a powerful data manipulation library for Python that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data easy and intuitive. The package is particularly well-suited for working with tabular data, such as spreadsheets or SQL tables, and provides powerful data cleaning, transformation, and wrangling capabilities. 

3. Matplotlib 

Matplotlib is a plotting library for Python that provides an extensive API for creating static, animated, and interactive visualizations. The library is highly customizable, and users can create a wide range of plots, including line plots, scatter plots, bar plots, histograms, and heat maps. Matplotlib is a great tool for data visualization and is widely used in data analysis, scientific computing, and machine learning. 

4. Seaborn 

Seaborn is a library for creating attractive and informative statistical graphics in Python. The library is built on top of Matplotlib and provides a high-level interface for creating complex visualizations, such as heat maps, violin plots, and scatter plots. Seaborn is particularly well-suited for visualizing complex datasets and is often used in data exploration and analysis. 

5. Scikit-learn 

Scikit-learn is a powerful library for machine learning in Python. It provides a wide range of tools for supervised and unsupervised learning, including linear regression, k-means clustering, and support vector machines. The library is built on top of NumPy and Pandas and is designed to be easy to use and highly extensible. Scikit-learn is a go-to tool for data scientists and machine learning practitioners. 

6. TensorFlow 

TensorFlow is an open-source software library for dataflow and differentiable programming across various tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. TensorFlow was developed by the Google Brain team and is used in many of Google’s products and services. 

7. SQLAlchemy

SQLAlchemy is a Python package that serves as both a SQL toolkit and an Object-Relational Mapping (ORM) library. It is designed to simplify the process of working with databases by providing a consistent and high-level interface. It offers a set of utilities and abstractions that make it easier to interact with relational databases using SQL queries. It provides a flexible and expressive syntax for constructing SQL statements, allowing you to perform various database operations such as querying, inserting, updating, and deleting data.

8. OpenCV

OpenCV (CV2) is a library of programming functions mainly aimed at real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage and is now maintained by Itseez. OpenCV is available for C++, Python, and Java. 

9. urllib 

urllib is a module in the Python standard library that provides a set of simple, high-level functions for working with URLs and web protocols. It includes functions for opening and closing network connections, sending and receiving data, and parsing URLs. 

10. BeautifulSoup 

BeautifulSoup is a Python library for parsing HTML and XML documents. It creates parse trees from the documents that can be used to extract data from HTML and XML files with a simple and intuitive API. BeautifulSoup is commonly used for web scraping and data extraction. 

Wrapping up 

In conclusion, these Python packages are some of the most popular and widely-used libraries in the Python data science ecosystem. They provide powerful and flexible tools for data manipulation, analysis, and visualization, and are essential for aspiring and practicing data scientists. With the help of these Python packages, data scientists can easily perform complex data analysis and machine learning tasks, and create beautiful and informative visualizations. 

If you want to learn more about data science and how to use these Python packages, we recommend checking out Data Science Dojo’s Python for Data Science course, which provides a comprehensive introduction to Python and its data science ecosystem. 

 

Ali Haider - Author
Ali Haider Shalwani
| April 27

This blog lists down-trending data science, analytics, and engineering GitHub repositories that can help you with learning data science to build your own portfolio.  

What is GitHub?

GitHub is a powerful platform for data scientists, data analysts, data engineers, Python and R developers, and more. It is an excellent resource for beginners who are just starting with data science, analytics, and engineering. There are thousands of open-source repositories available on GitHub that provide code examples, datasets, and tutorials to help you get started with your projects.  

This blog lists some useful GitHub repositories that will not only help you learn new concepts but also save you time by providing pre-built code and tools that you can customize to fit your needs. 

Want to get started with data science? Do check out ourData Science Bootcamp as it can navigate your way!  

Best GitHub repositories to stay ahead of the tech Curve

With GitHub, you can easily collaborate with others, share your code, and build a portfolio of projects that showcase your skills.  

Trending GitHub Repositories
Trending GitHub Repositories
  1. Scikit-learn: A Python library for machine learning built on top of NumPy, SciPy, and matplotlib. It provides a range of algorithms for classification, regression, clustering, and more.  

Link to the repository: https://github.com/scikit-learn/scikit-learn 

  1. TensorFlow: An open-source machine learning library developed by Google Brain Team. TensorFlow is used for numerical computation using data flow graphs.  

Link to the repository: https://github.com/tensorflow/tensorflow 

  1. Keras: A deep learning library for Python that provides a user-friendly interface for building neural networks. It can run on top of TensorFlow, Theano, or CNTK.  

Link to the repository: https://github.com/keras-team/keras 

  1. Pandas: A Python library for data manipulation and analysis. It provides a range of data structures for efficient data handling and analysis.  

Link to the repository: https://github.com/pandas-dev/pandas 

Add value to your skillset with our instructor-led live Python for Data Sciencetraining.  

  1. PyTorch: An open-source machine learning library developed by Facebook’s AI research group. PyTorch provides tensor computation and deep neural networks on a GPU.  

Link to the repository: https://github.com/pytorch/pytorch 

  1. Apache Spark: An open-source distributed computing system used for big data processing. It can be used with a range of programming languages such as Python, R, and Java.  

Link to the repository: https://github.com/apache/spark 

  1. FastAPI: A modern web framework for building APIs with Python. It is designed for high performance, asynchronous programming, and easy integration with other libraries.  

Link to the repository: https://github.com/tiangolo/fastapi 

  1. Dask: A flexible parallel computing library for analytic computing in Python. It provides dynamic task scheduling and efficient memory management.  

Link to the repository: https://github.com/dask/dask 

  1. Matplotlib: A Python plotting library that provides a range of 2D plotting features. It can be used for creating interactive visualizations, animations, and more.  

Link to the repository: https://github.com/matplotlib/matplotlib

 


Looking to begin exploring, analyzing, and visualizing data with Power BI Desktop? Our
Introduction to Power BItraining course is designed to assist you in getting started!

  1. Seaborn: A Python data visualization library based on matplotlib. It provides a range of statistical graphics and visualization tools.  

Link to the repository: https://github.com/mwaskom/seaborn

  1. NumPy: A Python library for numerical computing that provides a range of array and matrix operations. It is used extensively in scientific computing and data analysis.  

Link to the repository: https://github.com/numpy/numpy 

  1. Tidyverse: A collection of R packages for data manipulation, visualization, and analysis. It includes popular packages such as ggplot2, dplyr, and tidyr. 

Link to the repository: https://github.com/tidyverse/tidyverse 

In a nutshell

In conclusion, GitHub is a valuable resource for developers, data scientists, and engineers who are looking to stay ahead of the technology curve. With the vast number of repositories available, it can be overwhelming to find the ones that are most useful and relevant to your interests. The repositories we have highlighted in this blog cover a range of topics, from machine learning and deep learning to data visualization and programming languages. By exploring these repositories, you can gain new skills, learn best practices, and stay up-to-date with the latest developments in the field.

Do you happen to have any others in mind? Please feel free to share them in the comments section below!  

 

Data Science Dojo
Stephanie Kirmer
| March 3

Data science model deployment can sound intimidating if you have never had a chance to try it in a safe space. Do you want to make a rest API or a full frontend app? What does it take to do either of these? It’s not as hard as you might think. 

In this series, we’ll go through how you can take machine learning models and deploy them to a web app or a rest API (using saturn cloud) so that others can interact. In this app, we’ll let the user make some feature selections and then the model will predict an outcome for them. But using this same idea, you could easily do other things, such as letting the user retrain the model, upload things like images, or conduct other interactions with your model. 

Just to be interesting, we’re going to do this same project with two frameworks, voila and flask, so you can see how they both work and decide what’s right for your needs. In a flask, we’ll create a rest API and a web app version.
A

Learn data science with Data Science Dojo and Saturn Cloud
               Learn data science with Data Science Dojo and Saturn Cloud – Data Science DojoA

a
Our toolkit
 

Other helpful links 

The project – Deploying machine learning models

The first steps of our process are exactly the same, whether we are going for voila or flask. We need to get some data and build a model! I will take the us department of education’s college scorecard data, and build a quick linear regression model that accepts a few inputs and predicts a student’s likely earnings 2 years after graduation. (you can get this data yourself at https://collegescorecard.ed.gov/data/) 

About measurements 

According to the data codebook: “the cohort of evaluated graduates for earnings metrics consists of those individuals who received federal financial aid, but excludes those who were subsequently enrolled in school during the measurement year, died before the end of the measurement year, received a higher-level credential than the credential level of the field of the study measured, or did not work during the measurement year.” 

Load data 

I already did some data cleaning and uploaded the features I wanted to a public bucket on s3, for easy access. This way, I can load it quickly when the app is run. 

Format for training 

Once we have the dataset, this is going to give us a handful of features and our outcome. We just need to split it between features and target with scikit-learn to be ready to model. (note that all of these functions will be run exactly as written in each of our apps.) 

 Our features are: 

  • Region: geographic location of college 
  • Locale: type of city or town the college is in 
  • Control: type of college (public/private/for-profit) 
  • Cipdesc_new: major field of study (cip code) 
  • Creddesc: credential (bachelor, master, etc) 
  • Adm_rate_all: admission rate 
  • Sat_avg_all: average sat score for admitted students (proxy for college prestige) 
  • Tuition: cost to attend the institution for one year 


Our target outcome is earn_mdn_hi_2yr: median earnings measured two years after completion of degree.
 

Train model 

We are going to use scikit-learn’s pipeline to make our feature engineering as easy and quick as possible. We’re going to return a trained model as well as the r-squared value for the test sample, so we have a quick and straightforward measure of the model’s performance on the test set that we can return along with the model object. 

Now we have a model, and we’re ready to put together the app! All these functions will be run when the app runs, because it’s so fast that it doesn’t make sense to save out a model object to be loaded. If your model doesn’t train this fast, save your model object and return it in your app when you need to predict. 

If you’re interested in learning some valuable tips for machine learning projects, read our blog on machine learning project tips.

Visualization 

In addition to building a model and creating predictions, we want our app to show a visual of the prediction against a relevant distribution. The same plot function can be used for both apps, because we are using plotly for the job. 

The function below accepts the type of degree and the major, to generate the distributions, as well as the prediction that the model has given. That way, the viewer can see how their prediction compares to others. Later, we’ll see how the different app frameworks use the plotly object. 

 

 This is the general visual we’ll be generating — but because it’s plotly, it’ll be interactive! 

Deploying machine learning models
Deploying machine learning models

You might be wondering whether your favorite visualization library could work here — the answer is, maybe! Every python viz library has idiosyncrasies and is not likely to be supported exactly the same for voila and flask. I chose plotly because it has interactivity and is fully functional in both frameworks, but you are welcome to try your own visualization tool and see how it goes.  

Wrapping up

In conclusion, deploying machine learning models to a web app or REST API can seem daunting, but it’s not as difficult as it may seem. By using frameworks like voila and Flask, along with libraries like scikit-learn, plotly, and pandas, you can easily create an app that allows users to interact with machine learning models. In this project, we used the US Department of Education’s college scorecard data to build a linear regression model that predicts a student’s likely earnings two years after graduation.

 

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence