fbpx
until LLM Bootcamp: In-Person (Seattle) and Online Learn more

beautifulsoup

Python is a powerful and versatile programming language that has become increasingly popular in the field of data science. One of the main reasons for its popularity is the vast array of libraries and packages available for data manipulation, analysis, and visualization.

10 Python packages for data science and machine learning

In this article, we will highlight some of the top Python packages for data science that aspiring and practicing data scientists should consider adding to their toolbox. 

1. NumPy 

NumPy is a fundamental package for scientific computing in Python. It supports large, multi-dimensional arrays and matrices of numerical data, as well as a large library of mathematical functions to operate on these arrays. The package is particularly useful for performing mathematical operations on large datasets and is widely used in machine learning, data analysis, and scientific computing. 

2. Pandas 

Pandas is a powerful data manipulation library for Python that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data easy and intuitive. The package is particularly well-suited for working with tabular data, such as spreadsheets or SQL tables, and provides powerful data cleaning, transformation, and wrangling capabilities. 

3. Matplotlib 

Matplotlib is a plotting library for Python that provides an extensive API for creating static, animated, and interactive visualizations. The library is highly customizable, and users can create a wide range of plots, including line plots, scatter plots, bar plots, histograms, and heat maps. Matplotlib is a great tool for data visualization and is widely used in data analysis, scientific computing, and machine learning. 

4. Seaborn 

Seaborn is a library for creating attractive and informative statistical graphics in Python. The library is built on top of Matplotlib and provides a high-level interface for creating complex visualizations, such as heat maps, violin plots, and scatter plots. Seaborn is particularly well-suited for visualizing complex datasets and is often used in data exploration and analysis. 

5. Scikit-learn 

Scikit-learn is a powerful library for machine learning in Python. It provides a wide range of tools for supervised and unsupervised learning, including linear regression, k-means clustering, and support vector machines. The library is built on top of NumPy and Pandas and is designed to be easy to use and highly extensible. Scikit-learn is a go-to tool for data scientists and machine learning practitioners. 

6. TensorFlow 

TensorFlow is an open-source software library for dataflow and differentiable programming across various tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. TensorFlow was developed by the Google Brain team and is used in many of Google’s products and services. 

7. SQLAlchemy

SQLAlchemy is a Python package that serves as both a SQL toolkit and an Object-Relational Mapping (ORM) library. It is designed to simplify the process of working with databases by providing a consistent and high-level interface. It offers a set of utilities and abstractions that make it easier to interact with relational databases using SQL queries. It provides a flexible and expressive syntax for constructing SQL statements, allowing you to perform various database operations such as querying, inserting, updating, and deleting data.

8. OpenCV

OpenCV (CV2) is a library of programming functions mainly aimed at real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage and is now maintained by Itseez. OpenCV is available for C++, Python, and Java. 

9. urllib 

urllib is a module in the Python standard library that provides a set of simple, high-level functions for working with URLs and web protocols. It includes functions for opening and closing network connections, sending and receiving data, and parsing URLs. 

10. BeautifulSoup 

BeautifulSoup is a Python library for parsing HTML and XML documents. It creates parse trees from the documents that can be used to extract data from HTML and XML files with a simple and intuitive API. BeautifulSoup is commonly used for web scraping and data extraction. 

Wrapping up 

In conclusion, these Python packages are some of the most popular and widely-used libraries in the Python data science ecosystem. They provide powerful and flexible tools for data manipulation, analysis, and visualization, and are essential for aspiring and practicing data scientists. With the help of these Python packages, data scientists can easily perform complex data analysis and machine learning tasks, and create beautiful and informative visualizations. 

If you want to learn more about data science and how to use these Python packages, we recommend checking out Data Science Dojo’s Python for Data Science course, which provides a comprehensive introduction to Python and its data science ecosystem. 

 

May 1, 2023

Use Python and BeautifulSoup to web scrape. Web scraping is a very powerful tool to learn for any data professional. Make the entire internet your database.

Web scraping tutorial using Python and BeautifulSoup

With web scraping, the entire internet becomes your database. In this tutorial, we show you how to parse a web page into a data file (csv) using a Python package called BeautifulSoup.

web scraping

There are many services out there that augment their business data or even build out their entire business by using web scraping. For example there is a steam sales website that tracks and ranks steam sales, updated hourly. Companies can also scrape product reviews from places like Amazon to stay up-to-date with what customers are saying about their products.


The code

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client
#URl to web scrap from.
#in this example we web scrap graphics cards from Newegg.com
page_url = "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
#opens the connection and downloads html page from url
uClient = uReq(page_url)
#parses html into a soup data structure to traverse html
#as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
#finds each product from the store page
containers = page_soup.findAll("div", {"class": "item-container"})
#name the output file to write to local disk
out_filename = "graphics_cards.csv"
#header of csv file to be written
headers = "brand,product_name,shippingn"
#opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)
#loops over each product and grabs attributes about
#each product
for container in containers:
# Finds all link tags "a" from within the first div.
make_rating_sp = container.div.select("a")
# Grabs the title from the image title attribute
# Then does proper casing using .title()
brand = make_rating_sp[0].img["title"].title()
# Grabs the text within the second "(a)" tag from within
# the list of queries.
product_name = container.div.select("a")[2].text
# Grabs the product shipping information by searching
# all lists with the class "price-ship".
# Then cleans the text of white space with strip()
# Cleans the strip of "Shipping $" if it exists to just get number
shipping = container.findAll("li", {"class": "price-ship"})[0].text.strip().replace("$", "").replace(" Shipping", "")
# prints the dataset to console
print("brand: " + brand + "n")
print("product_name: " + product_name + "n")
print("shipping: " + shipping + "n")
# writes the dataset to file
f.write(brand + ", " + product_name.replace(",", "|") + ", " + shipping + "n")
f.close()  # Close the file

The video (enjoy!)

For more info, there’s a script that does the same thing in R

Want to learn more data science techniques in Python? Take a look at this introduction to Python for Data Science

August 18, 2022

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence