fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Python

Ali Haider - Author
Ali Haider Shalwani
| November 10

Generative AI is a branch of artificial intelligence that focuses on the creation of new content, such as text, images, music, and code. This is done by training machine learning models on large datasets of existing content, which the model then uses to generate new and original content. 

 

Want to build a custom large language model? Check out our in-person LLM bootcamp. 


Popular Python libraries for Generative AI

 

Python libraries for generative AI  | Data Science Dojo
Python libraries for generative AI

 

Python is a popular programming language for generative AI, as it has a wide range of libraries and frameworks available. Here are 10 of the top Python libraries for generative AI: 

 1. TensorFlow:

TensorFlow is a popular open-source machine learning library that can be used for a variety of tasks, including generative AI. TensorFlow provides a wide range of tools and resources for building and training generative models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

TensorFlow can be used to train and deploy a variety of generative models, including: 

  • Generative adversarial networks (GANs) 
  • Variational autoencoders (VAEs) 
  • Transformer-based text generation models 
  • Diffusion models 

TensorFlow is a good choice for generative AI because it is flexible and powerful, and it has a large community of users and contributors. 

 

2. PyTorch:

PyTorch is another popular open-source machine learning library that is well-suited for generative AI. PyTorch is known for its flexibility and ease of use, making it a good choice for beginners and experienced users alike. 

PyTorch can be used to train and deploy a variety of generative models, including: 

  • Conditional GANs 
  • Autoregressive models 
  • Diffusion models 

PyTorch is a good choice for generative AI because it is easy to use and has a large community of users and contributors. 

 

Large language model bootcamp

 

3. Transformers:

Transformers is a Python library that provides a unified API for training and deploying transformer models. Transformers are a type of neural network architecture that is particularly well-suited for natural language processing tasks, such as text generation and translation.

Transformers can be used to train and deploy a variety of generative models, including: 

  • Transformer-based text generation models, such as GPT-3 and LaMDA 

Transformers is a good choice for generative AI because it is easy to use and provides a unified API for training and deploying transformer models. 

 

4. Diffusers:

Diffusers is a Python library for diffusion models, which are a type of generative model that can be used to generate images, audio, and other types of data. Diffusers provides a variety of pre-trained diffusion models and tools for training and fine-tuning your own models.

Diffusers can be used to train and deploy a variety of generative models, including: 

  • Diffusion models for image generation 
  • Diffusion models for audio generation 
  • Diffusion models for other types of data generation 

 

Diffusers is a good choice for generative AI because it is easy to use and provides a variety of pre-trained diffusion models. 

 

 

5. Jax:

Jax is a high-performance numerical computation library for Python with a focus on machine learning and deep learning research. It is developed by Google AI and has been used to achieve state-of-the-art results in a variety of machine learning tasks, including generative AI. Jax has a number of advantages for generative AI, including:

  • Performance: Jax is highly optimized for performance, making it ideal for training large and complex generative models. 
  • Flexibility: Jax is a general-purpose numerical computing library, which gives it a great deal of flexibility for implementing different types of generative models. 
  • Ecosystem: Jax has a growing ecosystem of tools and libraries for machine learning and deep learning, which can be useful for developing and deploying generative AI applications. 

Here are some examples of how Jax can be used for generative AI: 

  • Training generative adversarial networks (GANs) 
  • Training diffusion models 
  • Training transformer-based text generation models 
  • Training other types of generative models, such as variational autoencoders (VAEs) and reinforcement learning-based generative models 

 

Get started with Python, checkout our instructor-led live Python for Data Science training.  

 

6. LangChain: 

LangChain is a Python library for chaining multiple generative models together. This can be useful for creating more complex and sophisticated generative applications, such as text-to-image generation or image-to-text generation.

Overview of LangChain Modules
Overview of LangChain Modules

LangChain is a good choice for generative AI because it makes it easy to chain multiple generative models together to create more complex and sophisticated applications.  

 

7. LlamaIndex:

LlamaIndex is a Python library for ingesting and managing private data for machine learning models. LlamaIndex can be used to store and manage your training datasets and trained models in a secure and efficient way.

 

LlamaIndex is a good choice for generative AI because it makes it easy to store and manage your training datasets and trained models in a secure and efficient way. 

 

8. Weight and biases:

Weight and Biases (W&B) is a platform that helps machine learning teams track, monitor, and analyze their experiments. W&B provides a variety of tools and resources for tracking and monitoring your generative AI experiments, such as:

  • Experiment tracking: W&B makes it easy to track your experiments and see how your models are performing over time. 
  • Model monitoring: W&B monitors your models in production and alerts you to any problems. 
  • Experiment analysis: W&B provides a variety of tools for analyzing your experiments and identifying areas for improvement. 


Learn to build LLM applications

 

9. Acme:

Acme is a reinforcement learning library for TensorFlow. Acme can be used to train and deploy reinforcement learning-based generative models, such as GANs and policy gradients.

Acme provides a variety of tools and resources for training and deploying reinforcement learning-based generative models, such as: 

  • Reinforcement learning algorithms: Acme provides a variety of reinforcement learning algorithms, such as Q-learning, policy gradients, and actor-critic. 
  • Environments: Acme provides a variety of environments for training and deploying reinforcement learning-based generative models. 
  • Model deployment: Acme provides tools for deploying reinforcement learning-based generative models to production. 

 

 Python libraries help in building generative AI applications

These libraries can be used to build a wide variety of generative AI applications, such as:

  • Chatbots: Chatbots can be used to provide customer support, answer questions, and engage in conversations with users.
  • Content generation: Generative AI can be used to generate different types of content, such as blog posts, articles, and even books.
  • Code generation: Generative AI can be used to generate code, such as Python, Java, and C++.
  • Image generation: Generative AI can be used to generate images, such as realistic photos and creative artwork.

Generative AI is a rapidly evolving field, and new Python libraries are being developed all the time. The libraries listed above are just a few of the most popular and well-established options.

Ruhma Khawaja author
Ruhma Khawaja
| June 9

The job market for data scientists is booming. In fact, the demand for data experts is expected to grow by 36% between 2021 and 2031, significantly higher than the average for all occupations. This is great news for anyone who is interested in a career in data science.

According to the U.S. Bureau of Labor Statistics, the job outlook for data science is estimated to be 36% between 2021–31, significantly higher than the average for all occupations, which is 5%. This makes it an opportune time to pursue a career in data science. 

Data Science Bootcamp
Data Science Bootcamp

What are Data Science Bootcamps? 

Data science boot camps are intensive, short-term programs that teach students the skills they need to become data scientists. These programs typically cover topics such as data wrangling, statistical inference, machine learning, and Python programming. 

  • Short-term: Bootcamps typically last for 3-6 months, which is much shorter than traditional college degrees. 
  • Flexible: Bootcamps can be completed online or in person, and they often offer part-time and full-time options. 
  • Practical experience: Bootcamps typically include a capstone project, which gives students the opportunity to apply the skills they have learned. 
  • Industry-focused: Bootcamps are taught by industry experts, and they often have partnerships with companies that are hiring data scientists. 


Top 10 Data Science Bootcamps

Without further ado, here is our selection of the most reputable data science boot camps.  

1. Data Science Dojo Data Science Bootcamp

  • Delivery Format: Online and In-person
  • Tuition: $2,659 to $4,500
  • Duration: 16 weeks
Data Science Dojo Bootcamp
Data Science Dojo Bootcamp

Data Science Dojo Bootcamp is an excellent choice for aspiring data scientists. With 1:1 mentorship and live instructor-led sessions, it offers a supportive learning environment. The program is beginner-friendly, requiring no prior experience. Easy installments with 0% interest options make it the top affordable choice. Rated as an impressive 4.96, Data Science Dojo Bootcamp stands out among its peers. Students learn key data science topics, work on real-world projects, and connect with potential employers. Moreover, it prioritizes a business-first approach that combines theoretical knowledge with practical, hands-on projects. With a team of instructors who possess extensive industry experience, students have the opportunity to receive personalized support during dedicated office hours.

2. Springboard Data Science Bootcamp

  • Delivery Format: Online
  • Tuition: $14,950
  • Duration: 12 months long
Springboard Data Science Bootcamp
Springboard Data Science Bootcamp

Springboard’s Data Science Bootcamp is a great option for students who want to learn data science skills and land a job in the field. The program is offered online, so students can learn at their own pace and from anywhere in the world. The tuition is high, but Springboard offers a job guarantee, which means that if you don’t land a job in data science within six months of completing the program, you’ll get your money back.

3. Flatiron School Data Science Bootcamp

  • Delivery Format: Online or On-campus (currently online only)
  • Tuition: $15,950 (full-time) or $19,950 (flexible)
  • Duration: 15 weeks long
Flatiron School Data Science Bootcamp
Flatiron School Data Science Bootcamp

Next on the list, we have Flatiron School’s Data Science Bootcamp. The program is 15 weeks long for the full-time program and can take anywhere from 20 to 60 weeks to complete for the flexible program.
Students have access to a variety of resources, including online forums, a community, and one-on-one mentorship.

4. Coding Dojo Data Science Bootcamp Online Part-Time

  • Delivery Format: Online
  • Tuition: $11,745 to $13,745
  • Duration: 16 to 20 weeks
Coding Dojo Data Science Bootcamp Online Part-Time
Coding Dojo Data Science Bootcamp Online Part-Time

Coding Dojo’s online bootcamp is open to students with any background and does not require a four-year degree or Python programming experience. Students can choose to focus on either data science and machine learning in Python or data science and visualization. It offers flexible learning options, real-world projects, and a strong alumni network. However, it does not guarantee a job, requires some prior knowledge, and is time-consuming.

5. CodingNomads Data Science and Machine Learning Course

  • Delivery Format: Online
  • Tuition: Membership: $9/month, Premium Membership: $29/month, Mentorship: $899/month
  • Duration: Self-paced
CodingNomads Data Science Course
CodingNomads Data Science Course

CodingNomads offers a data science and machine learning course that is affordable, flexible, and comprehensive. The course is available in three different formats: membership, premium membership, and mentorship. The membership format is self-paced and allows students to work through the modules at their own pace. The premium membership format includes access to live Q&A sessions. The mentorship format includes one-on-one instruction from an experienced data scientist. CodingNomads also offers scholarships to local residents and military students.

6. Udacity School of Data Science

  • Delivery Format: Online
  • Tuition: $399/month
  • Duration: Depends on the program
Udacity School of Data Science
Udacity School of Data Science

Udacity offers multiple data science bootcamps, including data science for business leaders, data project managers and more. It offers frequent start dates throughout the year for its data science programs. These programs are self-paced and involve real-world projects and technical mentor support. Students can also receive LinkedIn profile and GitHub portfolio reviews from Udacity’s career services. However, it is important to note that there is no job guarantee, so students should be prepared to put in the work to find a job after completing the program.

7. LearningFuze Data Science Bootcamp

  • Delivery Format: Online and in person
  • Tuition: $5,995 per module
  • Duration: Multiple formats
LearningFuze Data Science Bootcamp
LearningFuze Data Science Bootcamp

LearningFuze offers a data science boot camp through a strategic partnership with Concordia University Irvine. Offering students the choice of live online or in-person instruction, the program gives students ample opportunities to interact one-on-one with their instructors. LearningFuze also offers partial tuition refunds to students who are unable to find a job within six months of graduation.

The program’s curriculum includes modules in machine learning and deep learning and artificial intelligence. However, it is essential to note that there are no scholarships available, and the program does not accept the GI Bill.

8. Thinkful Data Science Bootcamp

  • Delivery Format: Online
  • Tuition: $16,950
  • Duration: 6 months
Thinkful Data Science Bootcamp
Thinkful Data Science Bootcamp

Thinkful offers a data science boot camp which is best known for its mentorship program. It caters to both part-time and full-time students. Part-time offers flexibility with 20-30 hours per week, taking 6 months to finish. Full-time is accelerated at 50 hours per week, completing in 5 months. Payment plans, tuition refunds, and scholarships are available for all students. The program has no prerequisites, so both fresh graduates and experienced professionals can take this program.

9. Brain Station Data Science Course Online

  • Delivery Format: Online
  • Tuition: $9,500 (part time); $16,000 (full time)
  • Duration: 10 weeks
Brain Station Data Science Course Online
Brain Station Data Science Course Online

BrainStation offers an immersive and hands-on data science boot camp that is both comprehensive and affordable. Industry experts teach the program and includes real-world projects and assignments. BrainStation has a strong job placement rate, with over 90% of graduates finding jobs within six months of completing the program. However, the program is expensive and can be demanding. Students should carefully consider their financial situation and time commitment before enrolling in the program.

10. BloomTech Data Science Bootcamp

  • Delivery Format: Online
  • Tuition: $19,950
  • Duration: 6 months
BloomTech Data Science Bootcamp
BloomTech Data Science Bootcamp

BloomTech offers a data science bootcamp covers a wide range of topics, including statistics, predictive modeling, data engineering, machine learning, and Python programming. BloomTech also offers a 4-week fellowship at a real company, which gives students the opportunity to gain work experience. BloomTech has a strong job placement rate, with over 90% of graduates finding jobs within six months of completing the program. The program is expensive and requires a significant time commitment, but it is also very rewarding.

What to expect in a data science bootcamp?

A data science bootcamp is a short-term, intensive program that teaches you the fundamentals of data science. While the curriculum may be comprehensive, it cannot cover the entire field of data science.

Therefore, it is important to have realistic expectations about what you can learn in a bootcamp. Here are some of the things you can expect to learn in a data science bootcamp:

  • Data science concepts: This includes topics such as statistics, machine learning, and data visualization.
  • Hands-on projects: You will have the opportunity to work on real-world data science projects. This will give you the chance to apply what you have learned in the classroom.
  • A portfolio: You will build a portfolio of your work, which you can use to demonstrate your skills to potential employers.
  • Mentorship: You will have access to mentors who can help you with your studies and career development.
  • Career services: Bootcamps typically offer career services, such as resume writing assistance and interview preparation.

Wrapping up

All and all, data science bootcamps can be a great way to learn the fundamentals of data science and gain the skills you need to launch a career in this field. If you are considering a boot camp, be sure to do your research and choose a program that is right for you.

Data Science Dojo
Nimrah Sohail
| June 2

Postman is a popular collaboration platform for API development used by developers all over the world. It is a powerful tool that simplifies the process of testing, documenting, and sharing APIs.

Postman provides a user-friendly interface that enables developers to interact with RESTful APIs and streamline their API development workflow. In this blog post, we will discuss the different HTTP methods, and how they can be used with Postman.

Postman and Python
Postman and Python

HTTP Methods

HTTP methods are used to specify the type of action that needs to be performed on a resource. There are several HTTP methods available, including GET, POST, PUT, DELETE, and PATCH. Each method has a specific purpose and is used in different scenarios:

  • GET is used to retrieve data from an API.
  • POST is used to create new data in an API.
  • PUT is used to update existing data in an API.
  • DELETE is used to delete data from an API.
  • PATCH is used to partially update existing data in an API.

1. GET Method

The GET method is used to retrieve information from the server. It is the most used HTTP method and is used to retrieve data from a server.   

In Postman, you can use the GET method to retrieve data from an API endpoint. To use the GET method, you need to specify the URL in the request bar and click on the Send button. Here are step-by-step instructions for making requests using GET: 

 In this tutorial, we are using the following URL:

Step 1:  

Create a new request by clicking + in the workbench to open a new tab.  

Step 2: 

Enter the URL of the API that we want to test. 

Step 3: 

Select the “GET” method. 

Get Method Step 3
Get Method Step 3

Click the “Send” button. 

2. POST Method

The POST method is used to send data to the server. It is commonly used to create new resources on the server. In Postman, you can use the POST method to send data to the server. To use the POST method, you need to specify the URL in the request. Here are step-by-step instructions for making requests using POST

  1. Create a new request.
  2. Enter the URL of the API that you want to test.
  3. Select the “POST” method.
  4. Add any additional headers or parameters to the request.
  5. Click the “Send” button.

3. PUT Method

PUT is used to update existing data in an API. In Postman, you can use the PUT method to update existing data in an API by selecting the “PUT” method from the drop-down menu next to the “Method” field.

You can also add data to the request body by clicking the “Body” tab and selecting the “raw” radio button. Here are step-by-step instructions for making requests using PUT

  1. Create a new request.
  2. Enter the URL of the API that you want to test.
  3. Select the “PUT” method.
  4. Add any additional headers or parameters to the request.
  5. Click the “Send” button.

4. DELETE Method

DELETE is used to delete existing data in an API. In Postman, you can use the DELETE method to delete existing data in an API by selecting the “DELETE” method from the drop-down menu next to the “Method” field. Here are step-by-step instructions for making requests using DELETE

  1. Create a new request.
  2. Enter the URL of the API that you want to test.
  3. Select the “DELETE” method.
  4. Add any additional headers or parameters to the request.
  5. Click the “Send” button.

5. PATCH Method

PATCH is used to partially update existing data in an API. In Postman, you can use the PATCH method to partially update existing data in an API by selecting the “PATCH” method from the drop-down menu next to the “Method” field.

You can also add data to the request body by clicking the “Body” tab and selecting the “raw” radio button. Here are step-by-step instructions for making requests using PATCH:

  1. Create a new request.
  2. Enter the URL of the API that you want to test.
  3. Select the “PATCH” method.
  4. Add any additional headers or parameters to the request.
  5. Click the “Send” button.

Why Postman and Python are useful together

With the Postman Python library, developers can create and send requests, manage collections and environments, and run tests. The library also provides a command-line interface (CLI) for interacting with Postman APIs from the terminal. 

How does Postman work with REST APIs? 

  • Creating Requests: Developers can use Postman to create HTTP requests for REST APIs. They can specify the request method, API endpoint, headers, and data. 
  • Sending Requests: Once the request is created, developers can send it to the API server. Postman provides tools for sending requests, such as the “Send” button, keyboard shortcuts, and history tracking. 
  • Testing Responses: Postman receives responses from the API server and displays them in the tool’s interface. Developers can test the response status, headers, and body. 
  • Debugging: Postman provides tools for debugging REST APIs, such as console logs and response time tracking. Developers can easily identify and fix issues with their APIs. 
  • Automation: Postman allows developers to automate testing, documentation, and other tasks related to REST APIs. Developers can write test scripts using JavaScript and run them using Postman’s test runner. 
  • Collaboration: Postman allows developers to share API collections with team members, collaborate on API development, and manage API documentation. Developers can also use Postman’s version control system to manage changes to their APIs.

Wrapping up

In summary, Postman is a powerful tool for working with REST APIs. It provides a user-friendly interface for creating, testing, and documenting REST APIs, as well as tools for debugging and automation. Developers can use Postman to collaborate with team members and manage API collections or developers working with APIs. 

Author image - Ayesha
Ayesha Saleem
| May 9

If you’re interested in investing in the stock market, you know how important it is to have access to accurate and up-to-date market data. This data can help you make informed decisions about which stocks to buy or sell, when to do so, and at what price. However, retrieving and analyzing this data can be a complex and time-consuming process. That’s where Python comes in.

Python is a powerful programming language that offers a wide range of tools and libraries for retrieving, analyzing, and visualizing stock market data. In this blog, we’ll explore how to use Python to retrieve fundamental stock market data, such as earnings reports, financial statements, and other key metrics. We’ll also demonstrate how you can use this data to inform your investment strategies and make more informed decisions in the market.

So, whether you’re a seasoned investor or just starting out, read on to learn how Python can help you gain a competitive edge in the stock market.

Using Python to retrieve fundamental stock market data
Using Python to retrieve fundamental stock market data – Source: Freepik  

How to retrieve fundamental stock market data using Python?

Python can be used to retrieve a company’s financial statements and earnings reports by accessing fundamental data of the stock.  Here are some methods to achieve this: 

1. Using the yfinance library:

One can easily get, read, and interpret financial data using Python by using the yfinance library along with the Pandas library. With this, a user can extract various financial data, including the company’s balance sheet, income statement, and cash flow statement. Additionally, yfinance can be used to collect historical stock data for a specific time period. 

2. Using Alpha Vantage:

Alpha Vantage offers a free API for enterprise-grade financial market data, including company financial statements and earnings reports. A user can extract financial data using Python by accessing the Alpha Vantage API. 

3. Using the get_quote_table method:

The get_quote_table method can be used to extract the data found on the summary page of a stock. This method extracts financial data from the summary page of stock and returns it in the form of a dictionary. From this dictionary, a user can extract the P/E ratio of a company, which is an important financial metric. Additionally, the get_stats_valuation method can be used to extract the P/E ratio of a company.

Python libraries for stock data retrieval: Fundamental and price data

Python has numerous libraries that enable us to access fundamental and price data for stocks. To retrieve fundamental data such as a company’s financial statements and earnings reports, we can use APIs or web scraping techniques.  

On the other hand, to get price data, we can utilize APIs or packages that provide direct access to financial databases. Here are some resources that can help you get started with retrieving both types of data using Python for data science: 

Retrieving fundamental data using API calls in Python is a straightforward process. An API or Application Programming Interface is a server that allows users to retrieve and send data to it using code.  

When requesting data from an API, we need to make a request, which is most commonly done using the GET method. The two most common HTTP request methods for API calls are GET and POST. 

After establishing a healthy connection with the API, the next step is to pull the data from the API. This can be done using the requests.get() method to pull the data from the mentioned API. Once we have the data, we can parse it into a JSON format. 

Top Python libraries like pandas and alpha_vantage can be used to retrieve fundamental data. For example, with alpha_vantage, the fundamental data of almost any stock can be easily retrieved using the Financial Data API. The formatting process can be coded and applied to the dataset to be used in future data science projects. 

Obtaining essential stock market information through APIs

There are various financial data APIs available that can be used to retrieve fundamental data of a stock. Some popular APIs are eodhistoricaldata.com, Nasdaq Data Link APIs, and Morningstar. 

  • Eodhistoricaldata.com, also known as EOD HD, is a website that provides more than just fundamental data and is free to sign up for. It can be used to retrieve fundamental data of a stock.  
  • Nasdaq Data Link APIs can be used to retrieve historical time-series of a stock’s price in CSV format. It offers a simple call to retrieve the data. 
  • Morningstar can also be used to retrieve fundamental data of a stock. One can search for a stock on the website and click on the first result to access the stock’s page and retrieve its data. 
  • Another source for fundamental financial company data is a free source created by a friend. All of the data is easily available from the website, and they offer API access to global stock data (quotes and fundamentals). The documentation for the API access can be found on their website. 

Once you have established a connection to an API, you can pull the fundamental data of a stock using requests. The fundamental data can then be parsed into JSON format using Python libraries such as pandas and alpha_vantage. 

Conclusion 

In summary, retrieving fundamental data using API calls in Python is a simple process that involves establishing a healthy connection with the API, pulling the data from the API using requests.get(), and parsing it into a JSON format. Python libraries like pandas and alpha_vantage can be used to retrieve fundamental data. 

 

Author image - Ayesha
Ayesha Saleem
| May 1

Python is a powerful and versatile programming language that has become increasingly popular in the field of data science. One of the main reasons for its popularity is the vast array of libraries and packages available for data manipulation, analysis, and visualization.

10 Python packages for data science and machine learning

In this article, we will highlight some of the top Python packages for data science that aspiring and practicing data scientists should consider adding to their toolbox. 

1. NumPy 

NumPy is a fundamental package for scientific computing in Python. It supports large, multi-dimensional arrays and matrices of numerical data, as well as a large library of mathematical functions to operate on these arrays. The package is particularly useful for performing mathematical operations on large datasets and is widely used in machine learning, data analysis, and scientific computing. 

2. Pandas 

Pandas is a powerful data manipulation library for Python that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data easy and intuitive. The package is particularly well-suited for working with tabular data, such as spreadsheets or SQL tables, and provides powerful data cleaning, transformation, and wrangling capabilities. 

3. Matplotlib 

Matplotlib is a plotting library for Python that provides an extensive API for creating static, animated, and interactive visualizations. The library is highly customizable, and users can create a wide range of plots, including line plots, scatter plots, bar plots, histograms, and heat maps. Matplotlib is a great tool for data visualization and is widely used in data analysis, scientific computing, and machine learning. 

4. Seaborn 

Seaborn is a library for creating attractive and informative statistical graphics in Python. The library is built on top of Matplotlib and provides a high-level interface for creating complex visualizations, such as heat maps, violin plots, and scatter plots. Seaborn is particularly well-suited for visualizing complex datasets and is often used in data exploration and analysis. 

5. Scikit-learn 

Scikit-learn is a powerful library for machine learning in Python. It provides a wide range of tools for supervised and unsupervised learning, including linear regression, k-means clustering, and support vector machines. The library is built on top of NumPy and Pandas and is designed to be easy to use and highly extensible. Scikit-learn is a go-to tool for data scientists and machine learning practitioners. 

6. TensorFlow 

TensorFlow is an open-source software library for dataflow and differentiable programming across various tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. TensorFlow was developed by the Google Brain team and is used in many of Google’s products and services. 

7. SQLAlchemy

SQLAlchemy is a Python package that serves as both a SQL toolkit and an Object-Relational Mapping (ORM) library. It is designed to simplify the process of working with databases by providing a consistent and high-level interface. It offers a set of utilities and abstractions that make it easier to interact with relational databases using SQL queries. It provides a flexible and expressive syntax for constructing SQL statements, allowing you to perform various database operations such as querying, inserting, updating, and deleting data.

8. OpenCV

OpenCV (CV2) is a library of programming functions mainly aimed at real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage and is now maintained by Itseez. OpenCV is available for C++, Python, and Java. 

9. urllib 

urllib is a module in the Python standard library that provides a set of simple, high-level functions for working with URLs and web protocols. It includes functions for opening and closing network connections, sending and receiving data, and parsing URLs. 

10. BeautifulSoup 

BeautifulSoup is a Python library for parsing HTML and XML documents. It creates parse trees from the documents that can be used to extract data from HTML and XML files with a simple and intuitive API. BeautifulSoup is commonly used for web scraping and data extraction. 

Wrapping up 

In conclusion, these Python packages are some of the most popular and widely-used libraries in the Python data science ecosystem. They provide powerful and flexible tools for data manipulation, analysis, and visualization, and are essential for aspiring and practicing data scientists. With the help of these Python packages, data scientists can easily perform complex data analysis and machine learning tasks, and create beautiful and informative visualizations. 

If you want to learn more about data science and how to use these Python packages, we recommend checking out Data Science Dojo’s Python for Data Science course, which provides a comprehensive introduction to Python and its data science ecosystem. 

 

Author image - Ayesha
Ayesha Saleem
| April 24

SQL (Structured Query Language) is an important tool for data scientists. It is a programming language used to manipulate data stored in relational databases. Mastering SQL concepts allows a data scientist to quickly analyze large amounts of data and make decisions based on their findings. Here are some essential SQL concepts that every data scientist should know:

First, understanding the syntax of SQL statements is essential in order to retrieve, modify or delete information from databases. For example, statements like SELECT and WHERE can be used to identify specific columns and rows within the database that need attention. A good knowledge of these commands can help a data scientist perform complex operations with ease.

Second, developing an understanding of database relationships such as one-to-one or many-to-many is also important for a data scientist working with SQL.

Here’s an interesting read about Top 10 SQL commands

Let’s dive into some of the key SQL concepts that are important to learn for a data scientist.  

1. Formatting Strings

We are all aware that cleaning up the raw data is necessary to improve productivity overall and produce high-quality decisions. In this case, string formatting is crucial and entails editing the strings to remove superfluous information. For transforming and manipulating strings, SQL provides a large variety of string methods. When combining two or more strings, CONCAT is utilized. The user-defined values that are frequently required in data science can be substituted for the null values using COALESCE. Tiffany Payne  

2. Stored Methods

We can save several SQL statements in our database for later use thanks to stored procedures. When invoked, it allows for reusability and has the ability to accept argument values. It improves performance and makes modifications simpler to implement. For instance, we’re attempting to identify all A-graded students with majors in data science. Keep in mind that CREATE PROCEDURE must be invoked using EXEC in order to be executed, exactly like the function definition. Paul Somerville 

3. Joins

Based on the logical relationship between the tables, SQL joins are used to merge the rows from various tables. In an inner join, only the rows from both tables that satisfy the specified criteria are displayed. In terms of vocabulary, it can be described as an intersection. The list of pupils who have signed up for sports is returned. Sports ID and Student registration ID are identical, please take note. Left Join returns every record from the LEFT table, while Right Join only shows the matching entries from the RIGHT table. Hamza Usmani 

4. Subqueries

Knowing how to utilize subqueries is crucial for data scientists because they frequently work with several tables and can use the results of one query to further limit the data in the primary query. The nested or inner query is another name for it. The subquery is conducted before the main query and needs to be surrounded in parenthesis. It is referred to as a multi-line subquery and requires the use of multi-line operators if it returns more than one row. Tiffany Payne 

5. Left Joins vs Inner Joins

It’s easy to confuse left joins and inner joins, especially for those who are still getting their feet wet with SQL or haven’t touched the language in a while. Make sure that you have a complete understanding of how the various joins produce unique outputs. You will likely be asked to do some kind of join in a significant number of interview questions, and in certain instances, the difference between a correct response and an incorrect one will depend on which option you pick. Tom Miller 

6. Manipulation of dates and times

There will most likely be some kind of SQL query using date-time data, and you should prepare for it. For instance, one of your tasks can be to organize the data into groups according to the months or to change the format of a variable from DD-MM-YYYY to only the month. You should be familiar with the following functions:

– EXTRACT
– DATEDIFF
– DATE ADD, DATE SUB
– DATE TRUNC 

Olivia Tonks 

7. Procedural Data Storage 

Using stored procedures, we can compile a series of SQL commands into a single object in the database and call it whenever we need it. It allows for reusability and when invoked, can take in values for its parameters. It improves efficiency and makes it simple to implement new features. Using this method, we can identify the students with the highest GPAs who have declared a particular major. One goal is to identify all A-students whose major is Data Science. It’s important to remember that, like a function declaration, calling a CREATE PROCEDURE with EXEC is necessary for the procedure to be executed. Nely Mihaylova 

8. Connecting SQL to Python or R 

A developer who is fluent in a statistical language, like Python or R, may quickly and easily use the packages of
language to construct machine learning models on a massive dataset stored in a relational database management system. A programmer’s employment prospects will improve dramatically if they are fluent in both these statistical languages and SQL. Data analysis, dataset preparation, interactive visualizations, and more may all be accomplished in SQL Server with the help of Python or R. Rene Delgado  

9. Features of windows

In order to apply aggregate and ranking functions over a specific window, window functions are used (set of rows). When defining a window with a function, the OVER clause is utilized. The OVER clause serves dual purposes:

– Separates rows into groups (PARTITION BY clause is used).
– Sorts the rows inside those partitions into a specified order (ORDER BY clause is used).
– Aggregate window functions refer to the application of aggregate
functions like SUM(), COUNT(), AVERAGE(), MAX(), and MIN() over a specific window (set of rows). Tom Hamilton Stubber  

10. The emergence of Quantum ML

With the use of quantum computing, more advanced artificial intelligence and machine learning models might be created. Despite the fact that true quantum computing is still a long way off, things are starting to shift as a result of the cloud-based quantum computing tools and simulations provided by Microsoft, Amazon, and IBM. Combining ML and quantum computing has the potential to greatly benefit enterprises by enabling them to take on problems that are currently insurmountable. Steve Pogson 

11. Predicates

Predicates occur from your WHERE, HAVING, and JOIN clauses. They limit the amount of data that has to be processed to run your query. If you say SELECT DISTINCT customer_name FROM customers WHERE signup_date = TODAY() that’s probably a much smaller query than if you run it without the WHERE clause because, without it, we’re selecting every customer that ever signed up!

Data science sometimes involves some big datasets. Without good predicates, your queries will take forever and cost a ton on the infra bill! Different data warehouses are designed differently, and data architects and engineers make different decisions about to lay out the data for the best performance. Knowing the basics of your data warehouse, and how the tables you’re using are laid out, will help you write good predicates that save your company a lot of money during the year, and just as importantly, make your queries run much faster.

For example, a query that runs quickly but simply touches a huge amount of data in Bigquery can be really expensive if you’re using on-demand pricing which scales with the amount of data touched by the query. The same query can be really cheap if you’re using Bigquery’s Flat-rate pricing or Snowflake, both of which are affected by how long your query takes to run, not how much data is fed into it. Kyle Kirwan 

12. Query Syntax

This is what makes SQL so powerful and much easier than coding individual statements for every task we want to complete when extracting data from a database. Every query starts with one or more clauses such as SELECT, FROM, or WHERE – each clause gives us different capabilities; SELECT allows us to define which columns we’d like returned in the results set; FROM indicates which table name(s) we should get our data from; WHERE allows us to specify conditions that rows must meet for them to be included in our result set etcetera! Understanding how all these clauses work together will help you write more effective and efficient queries quickly, allowing you to do better analysis faster! John Smith 

Elevate your business with essential SQL concepts 

AI and machine learning, which have been rapidly emerging, are quickly becoming one of the top trends in technology. Developments in AI and machine learning are being seen all over the world, from big businesses to small startups.

Businesses utilizing these two technologies are able to create smarter systems for their customers and employees, allowing them to make better decisions faster.

These advancements in artificial intelligence and machine learning are helping companies reach new heights with their products or services by providing them with more data to help inform decision-making processes.

Additionally, AI and machine learning can be used to automate mundane tasks that take up valuable time. This could mean more efficient customer service or even automated marketing campaigns that drive sales growth through
real-time analysis of consumer behavior. Rajesh Namase

Author image - Ayesha
Ayesha Saleem
| April 4

Are you interested in learning Python for Data Science? Look no further than Data Science Dojo’s Introduction to Python for Data Science course. This instructor-led live training course is designed for individuals who want to learn how to use Python to perform data analysis, visualization, and manipulation. 

Python is a powerful programming language used in data science, machine learning, and artificial intelligence. It is a versatile language that is easy to learn and has a wide range of applications. In this course, you will learn the basics of Python programming and how to use it for data analysis and visualization. 

Learn the basics of Python programming and how to use it for data analysis and visualization in Data Science Dojo’s Introduction to Python for Data Science course. This instructor-led live training course is designed for individuals who want to learn how to use Python to perform data analysis, visualization, and manipulation. 

Why learn Python for data science? 

Python is a popular language for data science because it is easy to learn and use. It has a large community of developers who contribute to open-source libraries that make data analysis and visualization more accessible. Python is also an interpreted language, which means that you can write and run code without the need for a compiler. 

Python has a wide range of applications in data science, including: 

  • Data analysis: Python is used to analyze data from various sources such as databases, CSV files, and APIs. 
  • Data visualization: Python has several libraries that can be used to create interactive and informative visualizations of data. 
  • Machine learning: Python has several libraries for machine learning, such as scikit-learn and TensorFlow. 
  • Web scraping: Python is used to extract data from websites and APIs.
Python for data science
Python for Data Science – Data Science Dojo

Python for Data Science Course Outline 

Data Science Dojo’s Introduction to Python for Data Science course covers the following topics: 

  • Introduction to Python: Learn the basics of Python programming, including data types, control structures, and functions. 
  • NumPy: Learn how to use the NumPy library for numerical computing in Python. 
  • Pandas: Learn how to use the Pandas library for data manipulation and analysis. 
  • Data visualization: Learn how to use the Matplotlib and Seaborn libraries for data visualization. 
  • Machine learning: Learn the basics of machine learning in Python using sci-kit-learn. 
  • Web scraping: Learn how to extract data from websites using Python. 
  • Project: Apply your knowledge to a real-world Python project. 


Python is an important programming language in the data science field and learning it can have significant benefits for data scientists. Here are some key points and reasons to learn Python for data science, specifically from Data Science Dojo’s instructor-led live training program:
 

  • Python is easy to learn: Compared to other programming languages, Python has a simpler and more intuitive syntax, making it easier to learn and use for beginners. 
  • Python is widely used: Python has become the preferred language for data science and is used extensively in the industry by companies such as Google, Facebook, and Amazon. 
  • Large community: The Python community is large and active, making it easy to get help and support. 
  • A comprehensive set of libraries: Python has a comprehensive set of libraries specifically designed for data science, such as NumPy, Pandas, Matplotlib, and Scikit-learn, making data analysis easier and more efficient. 
  • Versatile: Python is a versatile language that can be used for a wide range of tasks, from data cleaning and analysis to machine learning and deep learning. 
  • Job opportunities: As more and more companies adopt Python for data science, there is a growing demand for professionals with Python skills, leading to more job opportunities in the field. 


Data Science Dojo’s instructor-led live training program provides a structured and hands-on learning experience to master Python for data science. The program covers the fundamentals of
Python programming, data cleaning and analysis, machine learning, and deep learning, equipping learners with the necessary skills to solve real-world data science problems.  

By enrolling in the program, learners can benefit from personalized instruction, hands-on practice, and collaboration with peers, making the learning process more effective and efficient 

Some common questions asked about the course 

  • What are the prerequisites for the course? 

The course is designed for individuals with little to no programming experience. However, some familiarity with programming concepts such as variables, functions, and control structures is helpful. 

  • What is the format of the course? 

The course is an instructor-led live training course. You will attend live online classes with a qualified instructor who will guide you through the course material and answer any questions you may have. 

  • How long is the course? 

The course is four days long, with each day consisting of six hours of instruction. 

Conclusion 

If you’re interested in learning Python for Data Science, Data Science Dojo’s Introduction to Python for Data Science course is an excellent place to start. This course will provide you with a solid foundation in Python programming and teach you how to use Python for data analysis, visualization, and manipulation.  

With its instructor-led live training format, you’ll have the opportunity to learn from an experienced instructor and interact with other students. Enroll today and start your journey to becoming a data scientist with Python.

register now

Shehryar Author - Data Science
Shehryar Mallick
| March 13

This blog explores the difference between mutable and immutable objects in Python. 

Python is a powerful programming language with a wide range of applications in various industries. Understanding how to use mutable and immutable objects is essential for efficient and effective Python programming. In this guide, we will take a deep dive into mastering mutable and immutable objects in Python.

Mutable objects

In Python, an object is considered mutable if its value can be changed after it has been created. This means that any operation that modifies a mutable object will modify the original object itself. To put it simply, mutable objects are those that can be modified either in terms of state or contents after they have been created. The mutable objects that are present in python are lists, dictionaries and sets. 

Mutable-Objects-Code-1
Mutable-Objects-Code-1

 

Mutable-Objects-Code-2
Mutable-Objects-Code-2

 

Mutable-Objects-Code-3
Mutable-Objects-Code-3

 

Advantages of mutable objects 

  • They can be modified in place, which can be more efficient than recreating an immutable object. 
  • They can be used for more complex and dynamic data structures, like lists and dictionaries. 

Disadvantages of mutable objects 

  • They can be modified by another thread, which can lead to race conditions and other concurrency issues. 
  • They can’t be used as keys in a dictionary or elements in a set. 
  • They can be more difficult to reason about and debug because their state can change unexpectedly.

Want to start your EDA journey? Well you can always get yourself registered at Python for Data Science.

While mutable objects are a powerful feature of Python, they can also be tricky to work with, especially when dealing with multiple references to the same object. By following best practices and being mindful of the potential pitfalls of using mutable objects, you can write more efficient and reliable Python code.

Immutable objects 

In Python, an object is considered immutable if its value cannot be changed after it has been created. This means that any operation that modifies an immutable object returns a new object with the modified value. In contrast to mutable objects, immutable objects are those whose state cannot be modified once they are created. Examples of immutable objects in Python include strings, tuples, and numbers.

Immutable Objects Code 1
Immutable Objects Code 1

 

Immutable Objects Code 2
Immutable Objects Code 2

 

Immutable Objects Code 3
Immutable Objects Code 3

 

Advantages of immutable objects 

  • They are safer to use in a multi-threaded environment as they cannot be modified by another thread once created, thus reducing the risk of race conditions. 
  • They can be used as keys in a dictionary because they are hashable and their hash value will not change. 
  • They can be used as elements of a set because they are comparable, and their value will not change. 
  • They are simpler to reason about and debug because their state cannot change unexpectedly. 

Disadvantages of immutable objects

  • They need to be recreated if their value needs to be changed, which can be less efficient than modifying the state of a mutable object. 
  • They take up more memory if they are used in large numbers, as new objects need to be created instead of modifying the state of existing objects. 

How to work with mutable and immutable objects?

To work with mutable and immutable objects in Python, it is important to understand their differences. Immutable objects cannot be modified after they are created, while mutable objects can. Use immutable objects for values that should not be modified, and mutable objects for when you need to modify the object’s state or contents. When working with mutable objects, be aware of side effects that can occur when passing them as function arguments. To avoid side effects, make a copy of the mutable object before modifying it or use immutable objects as function arguments.

Wrapping up

In conclusion, mastering mutable and immutable objects is crucial to becoming an efficient Python programmer. By understanding the differences between mutable and immutable objects and implementing best practices when working with them, you can write better Python code and optimize your memory usage. We hope this guide has provided you with a comprehensive understanding of mutable and immutable objects in Python.

 

Ali Haider - Marketing manager
Ali Haider Shalwani
| March 8

Python has become a popular programming language in the data science community due to its simplicity, flexibility, and wide range of libraries and tools. With its powerful data manipulation and analysis capabilities, Python has emerged as the language of choice for data scientists, machine learning engineers, and analysts.    

By learning Python, you can effectively clean and manipulate data, create visualizations, and build machine-learning models. It also has a strong community with a wealth of online resources and support, making it easier for beginners to learn and get started.   

This blog will navigate your path via a detailed roadmap along with a few useful resources that can help you get started with it.   

Python Roadmap for Data Science Beginners
              Python Roadmap for Data Science Beginners – Data Science Dojo

Step 1. Learn the basics of Python programming  

Before you start with data science, it’s essential to have a solid understanding of its programming concepts. Learn about basic syntax, data types, control structures, functions, and modules.  

Step 2. Familiarize yourself with essential data science libraries   

Once you have a good grasp of Python programming, start with essential data science libraries like NumPy, Pandas, and Matplotlib. These libraries will help you with data manipulation, data analysis, and visualization.   

This blog lists some of the top Python libraries for data science that can help you get started.  

Step 3. Learn statistics and mathematics  

To analyze and interpret data correctly, it’s crucial to have a fundamental understanding of statistics and mathematics.   This short video tutorial can help you to get started with probability.   

Additionally, we have listed some useful statistics and mathematics books that can guide your way, do check them out!  

Step 4. Dive into machine learning  

Start with the basics of machine learning and work your way up to advanced topics. Learn about supervised and unsupervised learning, classification, regression, clustering, and more.   

This detailed machine-learning roadmap can get you started with this step.   

Step 5. Work on projects  

Apply your knowledge by working on real-world data science projects. This will help you gain practical experience and also build your portfolio. Here are some Python project ideas you must try out!  

Step 6. Keep up with the latest trends and developments 

Data science is a rapidly evolving field, and it’s essential to stay up to date with the latest developments. Join data science communities, read blogs, attend conferences and workshops, and continue learning.  

Our weekly and monthly data science newsletters can help you stay updated with the top trends in the industry and useful data science & AI resources, you can subscribe here.   

Additional resources   

  1. Learn how to read and index time series data using Pandas package and how to build, predict or forecast an ARIMA time series model using Python’s statsmodels package with this free course. 
  2. Explore this list of top packages and learn how to use them with this short blog. 
  3. Check out our YouTube channel for Python & data science tutorials and crash courses, it can surely navigate your way.

By following these steps, you’ll have a solid foundation in Python programming and data science concepts, making it easier for you to pursue a career in data science or related fields.   

For an in-depth introduction do check out our Python for Data Science training, it can help you learn the programming language for data analysis, analytics, machine learning, and data engineering. 

Wrapping up

In conclusion, Python has become the go-to programming language in the data science community due to its simplicity, flexibility, and extensive range of libraries and tools.

To become a proficient data scientist, one must start by learning the basics of Python programming, familiarizing themselves with essential data science libraries, understanding statistics and mathematics, diving into machine learning, working on projects, and keeping up with the latest trends and developments.

With the numerous online resources and support available, learning Python and data science concepts has become easier for beginners. By following these steps and utilizing the additional resources, one can have a solid foundation in Python programming and data science concepts, making it easier to pursue a career in data science or related fields.

Data Science Dojo
Stephanie Kirmer
| March 3

Data science model deployment can sound intimidating if you have never had a chance to try it in a safe space. Do you want to make a rest API or a full frontend app? What does it take to do either of these? It’s not as hard as you might think. 

In this series, we’ll go through how you can take machine learning models and deploy them to a web app or a rest API (using saturn cloud) so that others can interact. In this app, we’ll let the user make some feature selections and then the model will predict an outcome for them. But using this same idea, you could easily do other things, such as letting the user retrain the model, upload things like images, or conduct other interactions with your model. 

Just to be interesting, we’re going to do this same project with two frameworks, voila and flask, so you can see how they both work and decide what’s right for your needs. In a flask, we’ll create a rest API and a web app version.
A

Learn data science with Data Science Dojo and Saturn Cloud
               Learn data science with Data Science Dojo and Saturn Cloud – Data Science DojoA

a
Our toolkit
 

Other helpful links 

The project – Deploying machine learning models

The first steps of our process are exactly the same, whether we are going for voila or flask. We need to get some data and build a model! I will take the us department of education’s college scorecard data, and build a quick linear regression model that accepts a few inputs and predicts a student’s likely earnings 2 years after graduation. (you can get this data yourself at https://collegescorecard.ed.gov/data/) 

About measurements 

According to the data codebook: “the cohort of evaluated graduates for earnings metrics consists of those individuals who received federal financial aid, but excludes those who were subsequently enrolled in school during the measurement year, died before the end of the measurement year, received a higher-level credential than the credential level of the field of the study measured, or did not work during the measurement year.” 

Load data 

I already did some data cleaning and uploaded the features I wanted to a public bucket on s3, for easy access. This way, I can load it quickly when the app is run. 

Format for training 

Once we have the dataset, this is going to give us a handful of features and our outcome. We just need to split it between features and target with scikit-learn to be ready to model. (note that all of these functions will be run exactly as written in each of our apps.) 

 Our features are: 

  • Region: geographic location of college 
  • Locale: type of city or town the college is in 
  • Control: type of college (public/private/for-profit) 
  • Cipdesc_new: major field of study (cip code) 
  • Creddesc: credential (bachelor, master, etc) 
  • Adm_rate_all: admission rate 
  • Sat_avg_all: average sat score for admitted students (proxy for college prestige) 
  • Tuition: cost to attend the institution for one year 


Our target outcome is earn_mdn_hi_2yr: median earnings measured two years after completion of degree.
 

Train model 

We are going to use scikit-learn’s pipeline to make our feature engineering as easy and quick as possible. We’re going to return a trained model as well as the r-squared value for the test sample, so we have a quick and straightforward measure of the model’s performance on the test set that we can return along with the model object. 

Now we have a model, and we’re ready to put together the app! All these functions will be run when the app runs, because it’s so fast that it doesn’t make sense to save out a model object to be loaded. If your model doesn’t train this fast, save your model object and return it in your app when you need to predict. 

If you’re interested in learning some valuable tips for machine learning projects, read our blog on machine learning project tips.

Visualization 

In addition to building a model and creating predictions, we want our app to show a visual of the prediction against a relevant distribution. The same plot function can be used for both apps, because we are using plotly for the job. 

The function below accepts the type of degree and the major, to generate the distributions, as well as the prediction that the model has given. That way, the viewer can see how their prediction compares to others. Later, we’ll see how the different app frameworks use the plotly object. 

 

 This is the general visual we’ll be generating — but because it’s plotly, it’ll be interactive! 

Deploying machine learning models
Deploying machine learning models

You might be wondering whether your favorite visualization library could work here — the answer is, maybe! Every python viz library has idiosyncrasies and is not likely to be supported exactly the same for voila and flask. I chose plotly because it has interactivity and is fully functional in both frameworks, but you are welcome to try your own visualization tool and see how it goes.  

Wrapping up

In conclusion, deploying machine learning models to a web app or REST API can seem daunting, but it’s not as difficult as it may seem. By using frameworks like voila and Flask, along with libraries like scikit-learn, plotly, and pandas, you can easily create an app that allows users to interact with machine learning models. In this project, we used the US Department of Education’s college scorecard data to build a linear regression model that predicts a student’s likely earnings two years after graduation.

 

Manthan Koolwal
Manthan Koolwal
| February 27

These days social platforms are quite popular. Websites like YouTube, Facebook, Instagram, etc. are used widely by billions of people.  These websites have a lot of data that can be used for sentiment analysis against any incident, election prediction, result prediction of any big event, etc. If you have this data, you can analyze the risk of any decision.

In this post, we are going to web-scrape public Facebook pages using Python and Selenium. We will also discuss the libraries and tools required for the process. So, if you’re interested in web scraping and data analysis, keep reading!

Facebook scraping with Python

Read more about web scraping with Python and BeautifulSoup and kickstart your analysis today.   

What do we need before writing the code? 

We will use Python 3.x for this tutorial, and I am assuming that you have already installed it on your machine. Other than that, we need to install two III-party libraries BeautifulSoup and Selenium. 

  • BeautifulSoup — This will help us parse raw HTML and extract the data we need. It is also known as BS4. 
  • Selenium — It will help us render JavaScript websites. 
  • We also need chromium to render websites using Selenium API. You can download it from here. 

 

Before installing these libraries, you have to create a folder where you will keep the python script. 

Now, create a python file inside this folder. You can use any name and then finally, install these libraries. 

What will we extract from a Facebook page? 

We are going to scrape addresses, phone numbers, and emails from our target page. 

First, we are going to extract the raw HTML using Selenium from the Facebook page and then we are going to use. find() and .find_all() methods of BS4 to parse this data out of the raw HTML. Chromium will be used in coordination with Selenium to load the website. 

Read about: How to scrape Twitter data without Twitter API using SNScrape. 

Let’s start scraping  

Let’s first write a small code to see if everything works fine for us. 

Let’s understand the above code step by step. 

  • We have imported all the libraries that we installed earlier. We have also imported the time library. It will be used for the driver to wait a little more before closing the chromium driver. 
  • Then we declared the PATH of our chromium driver. This is the path where you have kept the chromedriver. 
  • One empty list and an object to store data. 
  • target_url holds the page we are going to scrape. 
  • Then using .Chrome() method we are going to create an instance for website rendering. 
  • Then using .get() method of Selenium API we are going to open the target page. 
  • .sleep() method will pause the script for two seconds. 
  • Then using .page_source we collect all the raw HTML of the page. 
  • .close() method will close down the chrome instance. 

 

Once you run this code it will open a chrome instance, then it will open the target page and then after waiting for two seconds the chrome instance will be closed. For the first time, the chrome instance will open a little slow but after two or three times it will work faster. 

Once you inspect the page you will find that the intro section, contact detail section, and photo gallery section all have the same class names

with a div. But since for this tutorial, our main focus is on contact details therefore we will focus on the second div tag. 

Let’s find this element using the .find() method provided by the BS4 API. 

We have created a parse tree using BeautifulSoup and now we are going to extract crucial data from it. 

Using .find_all() method we are searching for all the div tags with class


and then we selected the second element from the list.
 

Now, here is a catch. Every element in this list has the same class and tag. So, we have to use regular expressions in order to find the information we need to extract. 

Let’s find all of these element tags and then later we will use a for loop to iterate over each of these elements to identify which element is what. 

Here is how we will identify the address, number, and email. 

  • The address can be identified if the text contains more than two commas. 
  • The number can be identified if the text contains more than two dash(-). 
  • Email can be identified if the text contains “@” in it. 

We ran a for loop on allDetails variable. Then we are one by one identifying which element is what. Then finally if they satisfy the if condition we are storing it in the object o. 

In the end, you can append the object o in the list l and print it. 

Once you run this code you will find this result. 

Complete Code 

We can make further changes to this code to scrape more information from the page. But for now, the code will look like this. 

Conclusion 

Today we scraped the Facebook page to collect emails for lead generation. Now, this is just an example of scraping a single page. If you have thousands of pages, then we can use the Pandas library to store all the data in a CSV file. I leave this task for you as homework. 

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media. 

Data Science Dojo
Syed Umair Hasan
| February 22

In this step-by-step guide, learn how to deploy a web app for Gradio on Azure with Docker. This blog covers everything from Azure Container Registry to Azure Web Apps, with a step-by-step tutorial for beginners.

I was searching for ways to deploy a Gradio application on Azure, but there wasn’t much information to be found online. After some digging, I realized that I could use Docker to deploy custom Python web applications, which was perfect since I had neither the time nor the expertise to go through the “code” option on Azure. 

The process of deploying a web app begins by creating a Docker image, which contains all of the application’s code and its dependencies. This allows the application to be packaged and pushed to the Azure Container Registry, where it can be stored until needed. From there, it can be deployed to the Azure App Service, where it is run as a container and can be managed from the Azure Portal. In this portal, users can adjust the settings of their app, as well as grant access to roles and services when needed. 

Once everything is set and the necessary permissions have been granted, the web app should be able to properly run on Azure. Deploying a web app on Azure using Docker is an easy and efficient way to create and deploy applications, and can be a great solution for those who lack the necessary coding skills to create a web app from scratch!’

Comprehensive overview of creating a web app for Gradio

Gradio application 

Gradio is a Python library that allows users to create interactive demos and share them with others. It provides a high-level abstraction through the Interface class, while the Blocks API is used for designing web applications.

Blocks provide features like multiple data flows and demos, control over where components appear on the page, handling complex data flows, and the ability to update properties and visibility of components based on user interaction. With Gradio, users can create a web application that allows their users to interact with their machine learning model, API, or data science workflow. 

The two primary files in a Gradio Application are:

  1. App.py: This file contains the source code for the application.
  2. Requirements.txt: This file lists the Python libraries required for the source code to function properly.

Docker 

Docker is an open-source platform for automating the deployment, scaling, and management of applications, as containers. It uses a container-based approach to package software, which enables applications to be isolated from each other, making it easier to deploy, run, and manage them in a variety of environments. 

A Docker container is a lightweight, standalone, and executable software package that includes everything needed to run a specific application, including the code, runtime, system tools, libraries, and settings. Containers are isolated from each other and the host operating system, making them ideal for deploying microservices and applications that have multiple components or dependencies. 

Docker also provides a centralized way to manage containers and share images, making it easier to collaborate on application development, testing, and deployment. With its growing ecosystem and user-friendly tools, Docker has become a popular choice for developers, system administrators, and organizations of all sizes. 

Azure Container Registry 

Azure Container Registry (ACR) is a fully managed, private Docker registry service provided by Microsoft as part of its Azure cloud platform. It allows you to store, manage, and deploy Docker containers in a secure and scalable way, making it an important tool for modern application development and deployment. 

With ACR, you can store your own custom images and use them in your applications, as well as manage and control access to them with role-based access control. Additionally, ACR integrates with other Azure services, such as Azure Kubernetes Service (AKS) and Azure DevOps, making it easy to deploy containers to production environments and manage the entire application lifecycle. 

ACR also provides features such as image signing and scanning, which helps ensure the security and compliance of your containers. You can also store multiple versions of images, allowing you to roll back to a previous version if necessary. 

Azure Web App 

Azure Web Apps is a fully managed platform for building, deploying, and scaling web applications and services. It is part of the Azure App Service, which is a collection of integrated services for building, deploying, and scaling modern web and mobile applications. 

With Azure Web Apps, you can host web applications written in a variety of programming languages, such as .NET, Java, PHP, Node.js, and Python. The platform automatically manages the infrastructure, including server resources, security, and availability, so that you can focus on writing code and delivering value to your customers. 

Azure Web Apps supports a variety of deployment options, including direct Git deployment, continuous integration and deployment with Visual Studio Team Services or GitHub, and deployment from Docker containers. It also provides built-in features such as custom domains, SSL certificates, and automatic scaling, making it easy to deliver high-performing, secure, and scalable web applications. 

A step-by-step guide to deploying a Gradio application on Azure using Docker

This guide assumes a foundational understanding of Azure and the presence of Docker on your desktop. Refer to the Mac,  Windows , or Linux getting started instructions for Docker. 

Step 1: Create an Azure Container Registry resource 

Go to Azure Marketplace, search for ‘container registry’ and hit ‘Create’. 

STEP 1: Create an Azure Container Registry resource
Create an Azure Container Registry resource

Under the “Basics” tab, complete the required information and leave the other settings as the default. Then, click “Review + Create.” 

Web App for Gradio Step 1A
Web App for Gradio Step 1A

 

Step 2: Create a Web App resource in Azure 

In Azure Marketplace, search for “Web App”, select the appropriate resource as depicted in the image, and then click “Create”. 

STEP 2: Create a Web App resource in Azure
Create a Web App resource in Azure

 

Under the “Basics” tab, complete the required information, choose the appropriate pricing plan, and leave the other settings as the default. Then, click “Review + Create.”  

Web App for Gradio Step 2B
Web App for Gradio Step 2B

 

Web App for Gradio Step 2C
Web App for Gradio Step 2c

 

Upon completion of all deployments, the following three resources will be in your resource group. 

Web App for Gradio Step 2D
Web App for Gradio Step 2D

Step 3: Create a folder containing the “App.py” file and its corresponding “requirements.txt” file 

To begin, we will utilize an emotion detector application, the model for which can be found at https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion. 

APP.PY 

REQUIREMENTS.TXT 

Step 4: Launch Visual Studio Code and open the folder

Step 4: Launch Visual Studio Code and open the folder. 
Step 4: Launch Visual Studio Code and open the folder.

Step 5: Launch Docker Desktop to start Docker. 

STEP 5: Launch Docker Desktop to start Docker
STEP 5: Launch Docker Desktop to start Docker.

Step 6: Create a Dockerfile 

A Dockerfile is a script that contains instructions to build a Docker image. This file automates the process of setting up an environment, installing dependencies, copying files, and defining how to run the application. With a Dockerfile, developers can easily package their application and its dependencies into a Docker image, which can then be run as a container on any host with Docker installed. This makes it easy to distribute and run the application consistently in different environments. The following contents should be utilized in the Dockerfile: 

DOCKERFILE 

STEP 6: Create a Dockerfile
STEP 6: Create a Dockerfile

Step 7: Build and run a local Docker image 

Run the following commands in the VS Code terminal. 

1. docker build -t demo-gradio-app 

  • The “docker build” command builds a Docker image from a Docker file. 
  • The “-t demo-gradio-app” option specifies the name and optionally a tag to the name of the image in the “name:tag” format. 
  • The final “.” specifies the build context, which is the current directory where the Dockerfile is located.

 

2. docker run -it -d –name my-app -p 7000:7000 demo-gradio-app 

  • The “docker run” command starts a new container based on a specified image. 
  • The “-it” option opens an interactive terminal in the container and keeps the standard input attached to the terminal. 
  • The “-d” option runs the container in the background as a daemon process. 
  • The “–name my-app” option assigns a name to the container for easier management. 
  • The “-p 7000:7000” option maps a port on the host to a port inside the container, in this case, mapping the host’s port 7000 to the container’s port 7000. 
  • The “demo-gradio-app” is the name of the image to be used for the container. 

This command will start a new container with the name “my-app” from the “demo-gradio-app” image in the background, with an interactive terminal attached, and port 7000 on the host mapped to port 7000 in the container. 

Web App for Gradio Step 7A
Web App for Gradio Step 7A

 

Web App for Gradio Step 7B
Web App for Gradio Step 7B

 

To view your local app, navigate to the Containers tab in Docker Desktop, and click on link under Port. 

Web App for Gradio Step 7C
Web App for Gradio Step 7C

Step 8: Tag & Push the Image to Azure Container Registry 

First, enable ‘Admin user’ from the ‘Access Keys’ tab in Azure Container Registry. 

STEP 8: Tag & Push Image to Azure Container Registry
Tag & Push Images to Azure Container Registry

 

Login to your container registry using the following command, login server, username, and password can be accessed from the above step. 

docker login gradioappdemos.azurecr.io

Web App for Gradio Step 8B
Web App for Gradio Step 8B

 

Tag the image for uploading to your registry using the following command. 

 

docker tag demo-gradio-app gradioappdemos.azurecr.io/demo-gradio-app 

  • The command “docker tag demo-gradio-app gradioappdemos.azurecr.io/demo-gradio-app” is used to tag a Docker image. 
  • “docker tag” is the command used to create a new tag for a Docker image. 
  • “demo-gradio-app” is the source image name that you want to tag. 
  • “gradioappdemos.azurecr.io/demo-gradio-app” is the new image name with a repository name and optionally a tag in the “repository:tag” format. 
  • This command will create a new tag “gradioappdemos.azurecr.io/demo-gradio-app” for the “demo-gradio-app” image. This new tag can be used to reference the image in future Docker commands. 

Push the image to your registry. 

docker push gradioappdemos.azurecr.io/demo-gradio-app 

  • “docker push” is the command used to upload a Docker image to a registry. 
  • “gradioappdemos.azurecr.io/demo-gradio-app” is the name of the image with the repository name and tag to be pushed. 
  • This command will push the Docker image “gradioappdemos.azurecr.io/demo-gradio-app” to the registry specified by the repository name. The registry is typically a place where Docker images are stored and distributed to others. 
Web App for Gradio Step 8C
Web App for Gradio Step 8C

 

In the Repository tab, you can observe the image that has been pushed. 

Web App for Gradio Step 8D
Web App for Gradio Step 8B

Step 9: Configure the Web App 

Under the ‘Deployment Center’ tab, fill in the registry settings then hit ‘Save’. 

STEP 9: Configure the Web App
Configure the Web App

 

In the Configuration tab, create a new application setting for the website port 7000, as specified in the app.py file and the hit ‘Save’. 

Web App for Gradio Step 9B
Web App for Gradio Step 9B
Web App for Gradio Step 9C
Web App for Gradio Step 9C

 

Web App for Gradio Step 9D
Web App for Gradio Step 9D

 

In the Configuration tab, create a new application setting for the website port 7000, as specified in the app.py file and the hit ‘Save’. 

Web App for Gradio Step 9E
Web App for Gradio Step 9E

 

After the image extraction is complete, you can view the web app URL from the Overview page. 

 

Web App for Gradio Step 9F
Web App for Gradio Step 9F

 

Web App for Gradio Step 9G
Web App for Gradio Step 9G

Step 1O: Pushing Image to Docker Hub (Optional) 

Here are the steps to push a local Docker image to Docker Hub: 

  • Login to your Docker Hub account using the following command: 

docker login

  • Tag the local image using the following command, replacing [username] with your Docker Hub username and [image_name] with the desired image name: 

docker tag [image_name] [username]/[image_name]

  • Push the image to Docker Hub using the following command: 

docker push [username]/[image_name] 

  • Verify that the image is now available in your Docker Hub repository by visiting https://hub.docker.com/ and checking your repositories. 
Web App for Gradio Step 10A
Web App for Gradio Step 10A

 

Web App for Gradio Step 10B
Web App for Gradio Step 10B

Wrapping it up

In conclusion, deploying a web application using Docker on Azure is an easy and efficient way to create and deploy applications. This method is suitable for those who lack the necessary coding skills to create a web app from scratch. Docker is an open-source platform for automating the deployment, scaling, and management of applications, as containers.

Azure Container Registry is a fully managed, private Docker registry service provided by Microsoft as part of its Azure cloud platform. Azure Web Apps is a fully managed platform for building, deploying, and scaling web applications and services. By following the step-by-step guide provided in this article, users can deploy a Gradio application on Azure using Docker.

Nathan 500x500 web
Nathan Piccini
| February 3

In this blog post, we’ll explore five ideas for data science projects that can help you build expertise in computer vision, natural language processing (NLP), sales forecasting, cancer detection, and predictive maintenance using Python. 

As a data science student, it is important to continually build and improve your skills by working on projects that are both challenging and relevant to the field. 

 

Computer vision with Python and OpenCV 

Computer vision is a field of artificial intelligence that focuses on the development of algorithms and models that can interpret and understand visual information. One project idea in this area could be to build a facial recognition system using Python and OpenCV.

The project would involve training a model to detect and recognize faces in images and video and comparing the performance of different algorithms. To get started, you’ll want to become familiar with the OpenCV library, which is a powerful tool for image and video processing in Python. 

 

NLP with Python and NLTK/spaCy 

NLP is a field of AI that deals with the interaction between computers and human language. A great project idea in this area would be to develop a text classification system to automatically categorize news articles into different topics.

This project could use Python libraries such as NLTK or spaCy to preprocess the text data, and then train a machine-learning model to make predictions. The NLTK library has many useful functions for text preprocessing, such as tokenization, stemming and lemmatization, and the spaCy library is a modern library for performing complex NLP tasks. 

 

Learn more about Python project ideas for 2023

 

Sales forecasting with Python and Pandas 

Sales forecasting is an important part of business operations, and as a data science student, you should have a good understanding of how to build models that can predict future sales. A project idea in this area could be to create a sales forecasting model using Python and Pandas.

The project would involve using historical sales data to train a model that can predict future sales numbers for a particular product or market. To get started, you’ll want to become familiar with the Pandas library, which is a powerful tool for data manipulation and analysis in Python. 

 

Sales forecast using Python - data science projects
Sales forecast using Python

Cancer detection with Python and scikit-learn 

Cancer detection is a critical area of healthcare, and machine learning can play an important role in this field. A project idea in this area could be to build a machine-learning model to predict the likelihood of a patient having a certain type of cancer.

The project would use a dataset of patient medical records and explore the use of different features and algorithms for making predictions. The scikit-learn library is a powerful tool for building machine-learning models in Python and it provides an easy-to-use interface to train, test, and evaluate your model. 

 

Learn about Python for Data Science and speed up with Python fundamentals 

 

Predictive maintenance with Python and Scikit-learn 

Predictive maintenance is a field of industrial operations that focuses on using data and machine learning to predict when equipment is likely to fail so that maintenance can be scheduled in advance. A project idea in this area could be to develop a system that can analyze sensor data from the equipment, and use machine learning to identify patterns that indicate an imminent failure.

To get started, you’ll want to become familiar with the scikit-learn library and the concepts of clustering, classification, and regression, as well as the Python libraries for working with sensor data and machine learning. 

 

Data science projects in a nutshell:

These are just a few project ideas to help you build your skills as a data science student. Each of these projects offers the opportunity to work with real-world data, use powerful Python libraries and tools, and develop models that can make predictions and solve complex problems. As you work on these projects, you’ll gain valuable experience that will help you advance your career in. 

Ruhma - Author
Ruhma Khawaja
| February 2

Are you looking for some great Python Project Ideas? Here is a list of the top 5 Python project ideas for students and aspiring people to practice.
 

Want to start a career in programming? Here are the top 5 Python project ideas 

If you keep tabs on the latest technologies, you are aware of how powerful and versatile Python is. It is widely used in numerous fields, from data science and machine learning to web development and game development. It is a widely used programming language in computer science. Its features have made it a popular choice among developers in 2022 and its trend is expected to continue in the future.  

The demand for using Python in IT projects is on the rise, due to its user-friendly nature and versatility in creating various technology applications. A growing number of individuals in the tech industry are looking for ways to improve their skills by taking on projects, volunteering, and internships using Python. As a student, learning Python can open many opportunities for you and help you build a wide range of projects that can highlight your skills and capabilities.  

Are you looking for some great Python Project Ideas? Here is a list of the top 5 Python project ideas for engineering students and aspiring coders to practice. 

Python project ideas
Python project ideas – Data Science Dojo

1. Game Development 

Game development is a fun and challenging way to learn about programming and Python is a great language for building games. Using the Pygame library, you can easily create 2D games with features such as animation, sound, and user input. It is built on top of the SDL library, which provides low-level access to audio, keyboard, mouse, and display functions.

To create a simple game using Pygame, you will need to understand the basics of game development such as game loop, event handling, and game mechanics. You can use Pygame’s built-in functions to create a game window and display 2D graphics. This project will help you learn how to use Python for game development and gain experience with 2D graphics, animation, sound, and game mechanics. It will also give you a chance to explore the possibilities of Pygame library and create your own game. 

 

2. Weather App 

Creating a weather app is a great project idea for those interested in building applications that interact with external APIs. API, short for Application Programming Interface, are a set of rules and protocols that allow software systems to communicate. In this case, we will be using a weather API that provides current weather information for a given location. To build this weather app, you will first need to find a weather API that you can use.

To build a weather app with the request’s library in Python, first you choose a weather API and sign up for an API key. Next, you install the requests library in Python and fetch weather data with requests.get() and parse with json.loads(). Then, use pandas and matplotlib to analyze and visualize data and then create a user interface with a library like tkinter or PyQt. Lastly, try-except blocks for error handling and deploy your project on a web server or cloud platform if desired. 

 

Enroll in ‘Python for Data Science’ To learn Python and its effective use in data analysis, analytics, machine learning, and data science. 

 

3. Data Analysis 

Data analysis is an essential skill for many fields, and Python is an excellent language for working with data. The pandas and matplotlib libraries are commonly used in data analysis and visualization. Pandas is a powerful library for working with data in Python. Matplotlib is a plotting library for creating static, animated, and interactive visualizations in Python. It is used to create a wide variety of plots, including line plots, scatter plots, histograms, and heat maps. It also allows you to customize the appearance of the plots to match your needs. 

To start this project, select a dataset so that you can use pandas to read the data into a Data Frame and perform various operations on it. Then, you must clean and filter the data. Next, you can use matplotlib to create various visualizations of the data. This project will help you learn how to work with data in Python, gain experience with data analysis and visualization, and learn to use the pandas and matplotlib libraries.  

 

4. Chatbot 

Another hot topic is creating a chatbot. A chatbot is a computer program that simulates human conversation, and it can be used in a wide range of applications, such as customer service, e-commerce, and personal assistants. To build a chatbot using Python, you will need to use a combination of NLP and ML techniques.

For NLP, you can use Python libraries such as NLTK and Spacy, which provide tools for tokenizing, stemming, and lemmatizing text, as well as for performing part-of-speech tagging and named entity recognition. This project can have good learning outcomes like learning usage of natural language processing and machine learning techniques in Python. 

 

Learn about Top Python Packages

 

5. Web Scraper 

Web scraping is the process of extracting data from websites and a web scraper is a tool that automates this process. Creating a web scraper using Python’s Beautiful Soup library is a great project idea for those interested in web development and data mining. To build a web scraper, you will first need to install the Beautiful Soup library and the requests library. Another way is Selenium, a tool used for automating web browsers to do several tasks. 

The requests library is used to send an HTTP request to a website and retrieve the HTML source code, while Beautiful Soup is used to parse the HTML and extract the data. Beautiful Soup’s methods and selectors are used to extract the data required. 

 

Bottom Line 

In conclusion, there are countless possibilities for Python projects, these are just a small selection of ideas to spark inspiration. The key to success is to find a project that aligns with your interests and start experimenting with the vast array of libraries and frameworks that Python has to offer. With a bit of creativity and persistence, you can create something truly remarkable and elevate your skills to new heights. 

 

Waasif Nadeem
Waasif Nadeem
| August 4

This blog will discuss the strengths and limitations of Python and Julia to address a very common topic of debate; is Julia better than Python?

It is a high-level programming language that was designed in 2012, specifically for the Data Science and Machine Learning community. It was introduced as a mathematically oriented language and became popular for its speed and performance over other languages like Python and R.

Almost every introductory level course on Julia talks about its speed compared to Python, NumPy, and C, claiming that the performance of this language is as good as the speed of C. Also, it outperforms Python and NumPy but only by a margin. This leads to another debate; Will Julia conquer Python’s kingdom in Data Science?

To be able to address this question, let us dive deeper to compare several aspects of the two languages.   

python_julia
Python_Julia

Popularity and community

Python has been operational for over 30 years and is one of the most popular programming languages right now with a large developer community offering solutions and help for potential problems. This makes Python much easier and more convenient to use than any other language.

Julia has a small but rapidly growing and active community. Even though the number of followers is constantly increasing for it; the majority of support is still provided by the writers themselves. It is expected that when the scope of this programming language expands outside of data science, the popularity of the language will increase.

Speed

It takes leverage upon other languages when it comes to its execution speed. It is a compiled language primarily written on its own base. Well-written code in it can be as fast as in C. It is an excellent solution for challenges related to data analysis and statistical computing.

Python is an interpreted language that is not famous for its speed. Self-implemented functions in Python can take a lot longer to compile as compared to Julia or C. Therefore, it uses libraries like NumPy, Sklearn, and TensorFlow to implement different functions and algorithms. These libraries provide implementations of algorithms that are much faster than Python but slower than Julia.

Libraries

Python offers an extensive range of libraries that can be simply imported, and their functions can be used. Python is also supported by a large number of third-party libraries.

Julia does not have much in its library collection and the packages are not very well maintained. This makes some implementations like neural networks a bit tedious. Due to the lack of libraries, the scope of it is also limited, as many tasks like web development cannot be performed with this language yet. However, considering the expectations of the growing community, we can expect more developed and well-maintained libraries from it soon.

Code conversion

One of the most fascinating features of Julia is converting code from other programming languages to it. It is a very straightforward process that is widely supported.

In Python, code conversion is much more difficult than in Julia, but it is still possible. Julia’s code can be shared with Python using the module named “PyCall.”

Linear algebra (Data Science algorithms)

Julia was made with the intention of being used in statistics and machine learning. It offers various methods and algorithms for linear algebra. These methods are quite easy to implement, and their syntax is very similar to mathematical expressions.

Python does not have its own pre-defined methods for linear algebra, so users work through libraries, such as NumPy for such implementations. These implementations are, however, not as simple to use as in Julia.

Will Julia replace Python?

It would be too early to say that Julia will replace Python in Data Science. Both have their respective advantages. It depends on your use case and preference.

Python has built the trust of its community for years, and it is not an easy task for Julia to announce itself in that community. But it is not impossible either. As the community of this language grows, more support would be available for people. With the growth in resources, maybe in the near future, This language would be a new norm in Data Science.

Upgrade your data science skillset with our Python for Data Science and Data Science Bootcamp training!

Ebad - Data engineer
Ebad Ullah Khan
| October 19

In this blog, we will be learning how to program some basic movements in a drone with the help of Python. The drone we will use is Dji Tello. We will learn drone programming with Scratch, Swift, and even Python.  

 A step-by-step guide to learning drone programming

We will go step by step through how to issue commands through the Wi-Fi network 

drone programming
Drone – Data Science Dojo

 

Installing Python libraries 

First, we will need some Python libraries installed onto our laptop. Let’s install them with the following two commands: 

 

pip install djitellopy 

pip install opencv-python 

 

The djitellopy is a python library making use of the official Tello sdk. The second command is to install opencv which will help us to look through the camera of the drone. Some other libraries this program will make use of are ‘keyboard’ and ‘time’. After installation, we import them into our project   

 

import keyboard as kp 

from djitellopy import tello 

import time 

import cv2 

 

 Read more about Machine Learning using Python in cloud

Connection

We must first instantiate the Tello class so we can use it afterward. For the following commands to work, we must switch the drone to On and find and connect to the Wi-Fi network generated by it on our laptop. The tel.connect() command lets us connect the drone to our program. After the connection of the drone to our laptop is successful, the following commands can be executed. 

 

tel = tello.Tello() 
tel.connect() 

 

 

Sending ending commands to the drone 

We will build a function which will send movement commands to the drone.  

def getKeyboardInput(img): 

    kp.init() 

    lr, fb, ud, yv = 0, 0, 0, 0 

    speed = 50 

    if kp.getKey("LEFT"): 

        lr = -speed 

    elif kp.getKey("RIGHT"): 

        lr = speed 

 

    if kp.getKey("UP"): 

        fb = speed 

    elif kp.getKey("DOWN"): 

        fb = -speed 

 

    if kp.getKey("w"): 

        ud = speed 

    elif kp.getKey("s"): 

        ud = -speed 

     

    if kp.getKey("a"): 

        yv = speed 

    elif kp.getKey("d"): 

        yv = -speed 

 

    if kp.getKey("l"): 

        tel.land() 

    if kp.getKey("t"): 

        tel.takeoff() 

 

    if kp.getKey("z"): 

        cv2.imwrite("Resources/images/{time.time}.jpg", img) 

        time.sleep(0.05) 

    return [lr, fb, ud, yv] 

tel.streamon() 

 

 

The drone takes 4 inputs to move so we first take four values and assign a 0 to them. The speed must be set to an initial value for the drone to take off. Now we map the keyboard keys to our desired values and assign those values to the four variables. For example, if the keyboard key is “LEFT” then assign the speed with a value of -50. If the “RIGHT” key is pressed, then assign a value of 50 to the speed variable, and so on. The code block below explains how to map the keyboard keys to the variables: 

if kp.getKey("LEFT"): 

        lr = -speed 

    elif kp.getKey("RIGHT"): 

        lr = speed 

 

 

This program also takes two extra keys for landing and taking off (l and t). A keyboard key “z” is also assigned if we want to take a picture from the drone. As the drone’s video will be on, whenever we click on “z” key, opencv will save the image in a folder specified by us. After providing all the combinations, we must return the values in a 1D array. Also, don’t forget to run tel.streamon() to turn on the video streaming.     

We must make the drone take commands until and unless we press the “l” key for landing. So, we have a while True loop in the following code segment: 

 

Calling the function

 

while True: 

    img = tel.get_frame_read().frame 

    img = cv2.resize(img,(360,360)) 

    cv2.imshow('Picture',img) 

    cv2.waitKey(1) 
 
    vals = getKeyboardInput(img) 

    tel.send_rc_control(vals[0],vals[1],vals[2],vals[3]) 

    time.sleep(0.05) 

 

 

 

The get_frame_read() function reads the video frame by frame (just like an image) so we can resize it and show it on the laptop screen. The process will be so fast that it will completely look like a video being displayed.  

The last thing we must do is to call the function we created above. Remember, we have a list being returned from it. Each value of the list must be sent as a separate index value to the send_rc_control method of the tel object 

 

Execution 

 

Before running the code, confirm that the laptop is connected to the drone via Wi-Fi. 

Now, execute the python file and then press “t” for the drone to take off. From there, you can press the keyboard keys for it to move in your desired direction. When you want the drone to take pictures, press “z” and when you want it to land, press “l” 

 

Conclusion

 

In this blog, we learned how to issue basic keyboard commands for the drone to move. Furthermore, we can also add more keys for inbuilt Tello functions like “flip” and “move away”. Videos can be captured from the drone and stored locally on our laptop 

Data Science Dojo
Umair Hasan
| September 26

In this tutorial, you will learn how to create an attractive voice-controlled python chatbot application with a small amount of coding. To build our application we’ll first create a good-looking user interface through the built-in Tkinter library in Python and then we will create some small functions to achieve our task. 

 

Here is a sneak peek of what we are going to create. 

 

Voice controlled chatbot
Voice controlled chatbot using coding in Python – Data Science Dojo

Before kicking off, I hope you already have a brief idea about web scraping, if not then read the following article talking about Python web scraping 

 

PRO-TIP: Join our 5-day instructor-led Python for Data Science training to enhance your deep learning

 

Pre-requirements for building a voice python chatbot

Make sure that you are using Python 3.8+ and the following libraries are installed on it 

  • Pyttsx3 (pyttsx3 is a text-to-speech conversion library in Python) 
  • SpeechRecognition (Library for performing speech recognition) 
  • Requests (The requests module allows you to send HTTP requests using Python) 
  • Bs4 (Beautiful Soup is a library that is used to scrape information from web pages) 
  • pyAudio (With PyAudio, you can easily use Python to play and record audio) 

 

If you are still facing installation errors or incompatibility errors, then you can try downloading specific versions of the above libraries as they are tested and working currently in the application. 

 

  • Python 3.10 
  • pyttsx3==2.90 
  • SpeechRecognition==3.8.1 
  • requests==2.28.1
  • beautifulsoup4==4.11.1 
  • beautifulsoup4==4.11.1 

 

Now that we have set everything it is time to get started. Open a fresh new py file and name it VoiceChatbot.py. Import the following relevant libraries on the top of the file. 

 

  • from tkinter import * 
  • import time
  • import datetime
  • import pyttsx3
  • import speech_recognition as sr
  • from threading import Thread
  • import requests
  • from bs4 import BeautifulSoup 

 

The code is divided into the GUI section, which uses the Tkinter library of python and 7 different functions. We will start by declaring some global variables and initializing instances for text-to-speech and Tkinter. Then we start creating the windows and frames of the user interface. 

 

The user interface 

This part of the code loads images initializes global variables, and instances and then it creates a root window that displays different frames. The program starts when the user clicks the first window bearing the background image. 

 

if __name__ == “__main__”: 

 

#Global Variables 

loading = None
query = None
flag = True
flag2 = True

   

#initalizng text to speech and setting properties 

engine = pyttsx3.init() # Windows voices = engine.getProperty('voices') engine.setProperty('voice', voices[1].id) rate = engine.getProperty('rate') engine.setProperty('rate', rate-10) 

 

#loading images 

    img1= PhotoImage(file='chatbot-image.png') 
    img2= PhotoImage(file='button-green.png') 
    img3= PhotoImage(file='icon.png') 
    img4= PhotoImage(file='terminal.png') 
    background_image=PhotoImage(file="last.png") 
    front_image = PhotoImage(file="front2.png") 

 

#creating root window 

    root=Tk() 
    root.title("Intelligent Chatbot") 
    root.geometry('1360x690+-5+0')
    root.configure(background='white') 

 

#Placing frame on root window and placing widgets on the frame 

    f = Frame(root,width = 1360, height = 690) 
    f.place(x=0,y=0) 
    f.tkraise() 

 

#first window which acts as a button containing the background image 

    okVar = IntVar() 
    btnOK = Button(f, image=front_image,command=lambda: okVar.set(1)) 
    btnOK.place(x=0,y=0) 
    f.wait_variable(okVar) 
    f.destroy()     
    background_label = Label(root, image=background_image) 
    background_label.place(x=0, y=0) 

 

#Frame that displays gif image 

    frames = [PhotoImage(file='chatgif.gif',format = 'gif -index %i' %(i)) for i in range(20)] 
    canvas = Canvas(root, width = 800, height = 596) 
    canvas.place(x=10,y=10) 
    canvas.create_image(0, 0, image=img1, anchor=NW) 

 

#Question button which calls ‘takecommand’ function 

    question_button = Button(root,image=img2, bd=0, command=takecommand) 
    question_button.place(x=200,y=625) 

 

#Right Terminal with vertical scroll 

    frame=Frame(root,width=500,height=596) 
    frame.place(x=825,y=10) 
    canvas2=Canvas(frame,bg='#FFFFFF',width=500,height=596,scrollregion=(0,0,500,900)) 
    vbar=Scrollbar(frame,orient=VERTICAL) 
    vbar.pack(side=RIGHT,fill=Y) 
    vbar.config(command=canvas2.yview) 
    canvas2.config(width=500,height=596, background="black") 
    canvas2.config(yscrollcommand=vbar.set) 
    canvas2.pack(side=LEFT,expand=True,fill=BOTH) 
    canvas2.create_image(0,0, image=img4, anchor="nw") 
    task = Thread(target=main_window) 
    task.start() 
    root.mainloop() 

 

The main window functions 

This is the first function that is called inside a thread. It first calls the wishme function to wish the user. Then it checks whether the query variable is empty or not. If the query variable is empty, then it checks the contents of the query variable. If there is a shutdown or quit or stop word in query, then it calls the shutdown function, and the program exits. Else, it calls the web_scraping function. This function calls another function with the name wishme. 

 

def main_window(): 
    global query 
    wishme() 
    while True: 
        if query != None: 
            if 'shutdown' in query or 'quit' in query or 'stop' in query or 'goodbye' in query: 
                shut_down() 
                break 
            else: 
                web_scraping(query) 
                query = None 

 

The wish me function 

This function checks the current time and greets users according to the hour of the day and it also updates the canvas. The contents in the text variable are passed to the ‘speak’ function. The ‘transition’ function is also invoked at the same time in order to show the movement effect of the bot image, while the bot is speaking. This synchronization is achieved through threads, which is why these functions are called inside threads. 

 

def wishme(): 
    hour = datetime.datetime.now().hour 
    if 0 <= hour < 12: 
        text = "Good Morning sir. I am Jarvis. How can I Serve you?" 
    elif 12 <= hour < 18: 
        text = "Good Afternoon sir. I am Jarvis. How can I Serve you?" 
    else: 
        text = "Good Evening sir. I am Jarvis. How can I Serve you?" 
    canvas2.create_text(10,10,anchor =NW , text=text,font=('Candara Light', -25,'bold italic'), fill="white",width=350) 
    p1=Thread(target=speak,args=(text,)) 
    p1.start() 
    p2 = Thread(target=transition) 
    p2.start() 

 

The speak function 

This function converts text to speech using pyttsx3 engine. 

def speak(text): 
    global flag 
    engine.say(text) 
    engine.runAndWait() 
    flag=False 

 

The transition functions 

The transition function is used to create the GIF image effect, by looping over images and updating them on canvas. The frames variable contains a list of ordered image names.  

 

def transition(): 
    global img1 
    global flag 
    global flag2 
    global frames 
    global canvas 
    local_flag = False 
    for k in range(0,5000): 
        for frame in frames: 
            if flag == False: 
                canvas.create_image(0, 0, image=img1, anchor=NW) 
                canvas.update() 
                flag = True 
                return 
            else: 
                canvas.create_image(0, 0, image=frame, anchor=NW) 
                canvas.update() 
                time.sleep(0.1) 

 

The web scraping function 

This function is the heart of this application. The question asked by the user is then searched on google using the ‘requests’ library of python. The ‘beautifulsoap’ library extracts the HTML content of the page and checks for answers in four particular divs. If the webpage does not contain any of the four divs, then it searches for answers on Wikipedia links, however, if that is also not successful, then the bot apologizes.  

 

def web_scraping(qs): 
    global flag2 
    global loading 
    URL = 'https://www.google.com/search?q=' + qs 
    print(URL) 
    page = requests.get(URL) 
    soup = BeautifulSoup(page.content, 'html.parser') 
    div0 = soup.find_all('div',class_="kvKEAb") 
    div1 = soup.find_all("div", class_="Ap5OSd") 
    div2 = soup.find_all("div", class_="nGphre") 
    div3  = soup.find_all("div", class_="BNeawe iBp4i AP7Wnd") 

    links = soup.findAll("a") 
    all_links = [] 
    for link in links: 
       link_href = link.get('href') 
       if "url?q=" in link_href and not "webcache" in link_href: 
           all_links.append((link.get('href').split("?q=")[1].split("&sa=U")[0])) 

    flag= False 
    for link in all_links: 
       if 'https://en.wikipedia.org/wiki/' in link: 
           wiki = link 
           flag = True 
           break
    if len(div0)!=0: 
        answer = div0[0].text 
    elif len(div1) != 0: 
       answer = div1[0].text+"\n"+div1[0].find_next_sibling("div").text 
    elif len(div2) != 0: 
       answer = div2[0].find_next("span").text+"\n"+div2[0].find_next("div",class_="kCrYT").text 
    elif len(div3)!=0: 
        answer = div3[1].text 
    elif flag==True: 
       page2 = requests.get(wiki) 
       soup = BeautifulSoup(page2.text, 'html.parser') 
       title = soup.select("#firstHeading")[0].text
       paragraphs = soup.select("p") 
       for para in paragraphs: 
           if bool(para.text.strip()): 
               answer = title + "\n" + para.text 
               break 
    else: 
        answer = "Sorry. I could not find the desired results"
    canvas2.create_text(10, 225, anchor=NW, text=answer, font=('Candara Light', -25,'bold italic'),fill="white", width=350) 
    flag2 = False 
    loading.destroy()
    p1=Thread(target=speak,args=(answer,)) 
    p1.start() 
    p2 = Thread(target=transition) 
    p2.start() 

 

The take command function 

This function is invoked when the user clicks the green button to ask any question. The speech recognition library listens for 5 seconds and converts the audio input to text using google recognize API. 

 

def takecommand(): 
    global loading 
    global flag 
    global flag2 
    global canvas2 
    global query 
    global img4 
    if flag2 == False: 
        canvas2.delete("all") 
        canvas2.create_image(0,0, image=img4, anchor="nw")  
    speak("I am listening.") 
    flag= True 
    r = sr.Recognizer() 
    r.dynamic_energy_threshold = True 
    r.dynamic_energy_adjustment_ratio = 1.5 
    #r.energy_threshold = 4000 
    with sr.Microphone() as source: 
        print("Listening...") 
        #r.pause_threshold = 1 
        audio = r.listen(source,timeout=5,phrase_time_limit=5) 
        #audio = r.listen(source) 
 
    try: 
        print("Recognizing..") 
        query = r.recognize_google(audio, language='en-in') 
        print(f"user Said :{query}\n") 
        query = query.lower() 
        canvas2.create_text(490, 120, anchor=NE, justify = RIGHT ,text=query, font=('fixedsys', -30),fill="white", width=350) 
        global img3 
        loading = Label(root, image=img3, bd=0) 
        loading.place(x=900, y=622) 
 
    except Exception as e: 
        print(e) 
        speak("Say that again please") 
        return "None"

 

The shutdown function 

This function farewells the user and destroys the root window in order to exit the program. 

def shut_down(): 
    p1=Thread(target=speak,args=("Shutting down. Thankyou For Using Our Sevice. Take Care, Good Bye.",)) 
    p1.start() 
    p2 = Thread(target=transition) 
    p2.start() 
    time.sleep(7) 
   root.destroy()

 

Conclusion 

It is time to wrap up, I hope you enjoyed our little application. This is the power of Python, you can create small attractive applications in no time with a little amount of code. Keep following us for more cool python projects! 

 

Code - CTA

 

Data Science Dojo
Austin Chia
| September 22

Data science tools are becoming increasingly popular as the demand for data scientists increases. However, with so many different tools, knowing which ones to learn can be challenging

In this blog post, we will discuss the top 7 data science tools that you must learn. These tools will help you analyze and understand data better, which is essential for any data scientist.

So, without further ado, let’s get started!

List of 7 data science tools 

There are many tools a data scientist must learn, but these are the top 7:

Top 7 data science tools - Data Science Dojo
Top 7 data science tools you must learn
  • Python
  • R Programming
  • SQL
  • Java
  • Apache Spark
  • Tensorflow
  • Git

And now, let me share about each of them in greater detail!

1. Python

Python is a popular programming language that is widely used in data science. It is easy to learn and has many libraries that can be used to analyze data, machine learning, and deep learning.

It has many features that make it attractive for data science: An intuitive syntax, rich libraries, and an active community.

Python is also one of the most popular languages on GitHub, a platform where developers share their code.

Therefore, if you want to learn data science, you must learn Python!

There are several ways you can learn Python:

  • Take an online course: There are many online courses that you can take to learn Python. I recommend taking several introductory courses to familiarize yourself with the basic concepts.

 

PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your deep learning skills.

 

  • Read a book: You can also pick up a guidebook to learning data science. They’re usually highly condensed with all the information you need to get started with Python programming.
  • Join a Boot Camp: Boot camps are intense, immersive programs that will teach you Python in a short amount of time.

 

Whichever way you learn Python, make sure you make an effort to master the language. It will be one of the essential tools for your data science career.

2. R Programming

R is another popular programming language that is highly used among statisticians and data scientists. They typically use R for statistical analysis, data visualization, and machine learning.

R has many features that make it attractive for data science:

  • A wide range of packages
  • An active community
  • Great tools for data visualization (ggplot2)

These features make it perfect for scientific research!

In my experience with using R as a healthcare data analyst and data scientist, I enjoyed using packages like ggplot2 and tidyverse to work on healthcare and biological data too!

If you’re going to learn data science with a strong focus on statistics, then you need to learn R.

To learn R, consider working on a data mining project or taking a certificate in data analytics.

 

3. SQL

SQL (Structured Query Language) is a database query language used to store, manipulate, and retrieve data from data sources. It is an essential tool for data scientists because it allows them to work with databases.

SQL has many features that make it attractive for data science: it is easy to learn, can be used to query large databases, and is widely used in industry.

If you want to learn data science involving big data sets, then you need to learn SQL. SQL is also commonly used among data analysts if that’s a career you’re also considering exploring.

There are several ways you can learn SQL:

  • Take an online course: There are plenty of SQL courses online. I’d pick one or two of them to start with
  • Work on a simple SQL project
  • Watch YouTube tutorials
  • Do SQL coding questions

 

4. Java

Java is another programming language to learn as a data scientist. Java can be used for data processing, analysis, and NLP (Natural Language Processing).

Java has many features that make it attractive for data science: it is easy to learn, can be used to develop scalable applications, and has a wide range of frameworks commonly used in data science. Some popular frameworks include Hadoop and Kafka.

There are several ways you can learn Java:

 

5. Apache Spark

Apache Spark is a powerful big data processing tool that is used for data analysis, machine learning, and streaming. It is an open-source project that was originally developed at UC Berkeley’s AMPLab.

Apache Spark is known for its uses in large-scale data analytics, where data scientists can run machine learning on single-node clusters and machines.

Spark has many features made for data science:

  • It can process large datasets quickly
  • It supports multiple programming languages
  • It has high scalability
  • It has a wide range of libraries

If you want to learn big data science, then Apache Spark is a must-learn. Consider taking an online course or watching a webinar on big data to get started.

 

6. Tensorflow

TensorFlow is a powerful toolkit for machine learning developed by Google. It allows you to build and train complex models quickly.

Some ways TensorFlow is useful for data science:

  • Provides a platform for data automation
  • Model monitoring
  • Model training

Many data scientists use TensorFlow with Python to develop machine learning models. TensorFlow helps them to build complex models quickly and easily.

If you’re interested to learn TensorFlow, do consider these ways:

  • Read the official documentation
  • Complete online courses
  • Attend a TensorFlow meetup

However, to learn and practice your Tensorflow skills, you’ll need to pick up decent deep learning hardware to support the running of your algorithms.

 

7. Git

Git is a version control system used to track code changes. It is an essential tool for data scientists because it allows them to work on projects collaboratively and keep track of their work.

Git is useful in data science for:

If you’re planning to enter data science, Git is a must-know tool! Since you’ll be coding a lot in Python/R/Java, you’ll want to master Git to work with your team well in a collaborative coding environment.

Git is also an essential part of using GitHub, a code repository platform used by many data scientists.

To learn Git, I’d recommend just watching simple tutorials on YouTube.

Final thoughts

And these are the top seven data science tools that you must learn!

The most important thing is to get started and keep upskilling yourself! There is no one-size-fits-all solution in data science, so find the tools that work best for you and your team and start learning.

I hope this blog post has been helpful in your journey to becoming a data scientist. Happy learning!

 

Data Science Dojo
Ali Mohsin
| August 3

Data Science Dojo has launched  Jupyter Hub for Deep Learning using Python offering to the Azure Marketplace with pre-installed Deep Learning libraries and pre-cloned GitHub repositories of famous Deep Learning books and collections which enables the learner to run the example codes provided.

What is Deep Learning?

Deep learning is a subfield of machine learning and artificial intelligence (AI) that mimics how people gain specific types of knowledge. Deep learning algorithms are incredibly complex and the structure of these algorithms, where each neuron is connected to the other and transmits information, is quite similar to that of the nervous system.

Also, there are different types of neural networks to address specific problems or datasets, for example, Convolutional neural networks (CNNs) and Recurrent neural networks (RNNs).

While in the field of Data Science, which also encompasses statistics and predictive modeling, deep learning contains a key component. This procedure is made quicker and easier by deep learning, which is highly helpful for data scientists who are tasked with gathering, processing, and interpreting vast amounts of data.

Deep Learning using Python

Python, a high-level programming language that was created in 1991 and has seen a rise in popularity, is compatible with deep learning, which has contributed to its development. While several languages, including C++, Java, and LISP, can be used with deep learning, Python continues to be the preferred option for millions of developers worldwide.

Additionally, data is the essential component in all deep learning algorithms and applications, both as training data and as input. Python is a great tool to employ for managing large volumes of data for training your deep learning system, inputting input, or even making sense of its output because it is primarily used for data management, processing, and forecasting.

PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your deep learning skills.

deep learning

Challenges for individuals

Individuals who want to upgrade their path from Machine Learning to Deep Learning and want to start with it usually lack the resources to gain hands-on experience with Deep Learning. A beginner in Deep Learning also faces compatibility issues while installing libraries.

What we provide

Jupyter Hub for Deep Learning using Python solves all the challenges by providing you an effortless coding environment in the cloud with pre-installed Deep Learning python libraries which reduces the burden of installation and maintenance of tasks hence solving the compatibility issues for an individual.

Moreover, this offer provides the user with repositories of famous authors and books on Deep Learning which contain chapter-wise notebooks with some exercises which serve as a learning resource for a user in gaining hands-on experience with Deep Learning.

The heavy computations required for Deep Learning applications are not performed on the user’s local machine. Instead, they are performed in the Azure cloud, which increases responsiveness and processing speed.

Listed below are the pre-installed python libraries related to Deep learning and the sources of repositories of Deep Learning books provided by this offer:

Python libraries:

  • NumPy
  • Matplotlib
  • Pandas
  • Seaborn
  • TensorFlow
  • Tflearn
  • PyTorch
  • Keras
  • Scikit Learn
  • Lasagne
  • Leather
  • Theano
  • D2L
  • OpenCV

Repositories:

  • GitHub repository of book Deep Learning with Python 2nd Edition, by author François Chollet.
  • GitHub repository of book Hands-on Deep Learning Algorithms with Python, by author Sudharsan Ravichandran.
  • GitHub repository of book Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, by author Geron Aurelien.
  • GitHub repository of collection on Deep Learning Models, by author Sebastian Raschka.

Conclusion:

Jupyter Hub for Deep Learning using Python provides an in-browser coding environment with just a single click, hence providing ease of installation. Through this offer, a user can work on a variety of Deep Learning applications self-driving cars, healthcare, fraud detection, language translations, auto-completion of sentences, photo descriptions, image coloring and captioning, object detection, and localization.

This Jupyter Hub for Deep Learning instance is ideal to learn more about Deep Learning without the need to worry about configurations and computing resources.

The heavy resource requirement to deal with large datasets and perform the extensive model training and analysis for these applications is no longer an issue as heavy computations are now performed on Microsoft Azure which increases processing speed.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data.

We are therefore adding a free Jupyter Notebook Environment dedicated specifically to Deep Learning using Python. Install the Jupyter Hub offer now from the Azure Marketplace, your ideal companion in your journey to learn data science!

Try Now!

Syed Saad Peerzada
Syed Saad Peerzada
| August 4

What is web scraping?

Web scraping is the act of extracting the content and data from a website. The vast amount of data available on the internet is not open and available to download. As a result, ethical web scraping is the most effective technique to collect this data. There is also a debate about the legality of web scraping as the content may get stolen or the website can crash as a result of web scraping.

Ethical Web Scraping is the act of harvesting data legally by following ethical rules about web scraping. There are certain rules in ethical web scraping that when followed ensure trust between the website owner and web scraper.

Web scraping using Python

In Python, a learner can write a small piece of code to do large tasks. Since web scraping is used to save time, a small code written in Python can save a lot of time. Also, Python is simple and easy to understand and provides an extensive set of libraries for web scraping and further manipulation required on extracted data.

PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your web scraping skills.

Challenges for individuals

Individuals who are new to web scraping and wish to flourish in their field usually lack the necessary computing and learning resources to obtain hands-on expertise. Also, they may face compatibility issues when installing libraries.

What we provide

With just a single click, Jupyter Hub for Ethical Web Scraping using Python comes with pre-installed Web Scraping python libraries, which gives the learner an effortless coding environment in the Azure cloud and reduces the burden of installation. Moreover, this offer provides the learner with a repository of the famous book on web scraping which contains chapter-wise notebooks which serve as a learning resource for a user in gaining hands-on experience with web scraping.

Through this offer, a learner can collect data from various sources legally by following the best practices for ethical web scraping mentioned in the latter section of this blog. Once the data is collected, it can be further analyzed to get valuable insights into almost everything while all the heavy computations are performed on Microsoft Azure hence saving the user from the trouble of running high computations on the local machine.

Python libraries:

Listed below are the pre-installed web scraping python libraries and the sources of repositories of web scraping book provided by this offer:

  •          Pandas
  •          NumPy
  •          Scikit-learn
  •          Beautifulsoup4
  •          lxml
  •          MechanicalSoup
  •          Requests
  •          Scrapy
  •          Selenium
  •          urllib

Repository:

  •          GitHub repository of book Web Scraping with Python 2nd Edition,
    by author Ryan Mitchell.

Best practices for ethical web scraping

Globally, there is a debate about whether web scraping is an ethical concept or not. The reason it is unethical is that when a website is queried repeatedly by the same user (in this case bot), too many requests land on the server simultaneously and all resources of the server may be consumed in generating responses for each request, preventing it from responding to other legitimate users.

In this way, the server denies responses to any further users, commonly known as a Denial of Service (DoS) attack.

Below are the best practices for ethical web scraping, and compliance with these will allow a web scraper to work ethically.

1.   Check out for ROBOTS.TXT

Robots.txt file, also known as the Robots Exclusion Standard, is used to inform the web scrapers if the website can be crawled or not, if yes then how to index the website. A legitimate web scraper is expected to respect the instructions in this file and not disobey the website owner’s allowed instructions.

2.   Check for website APIs

An ethical web scraper is expected to first look for the public API of the website in question instead of scraping it all together. Many website owners provide public API access which can be used by anyone looking to gain from the information available on the website. Provision of public API works in the best interests of both the ethical scrapper as well as the website owner, avoiding web scraping altogether.

3.   Avoid repeated requests

Vigorous scraping can occasionally cause functionality issues, resulting in a poor user experience for humans. As a result, it is always advised to scrape during off-peak hours. An ethical web scraper is expected to delay recurrent requests to avoid a DoS attack.

4.   Provide your identity

It is always a good idea to take responsibility for one’s actions. An ethical web scraper never hides his or her identity and provides it in a user-agent string. Not only does this make the intentions of the scraper clear but also provides a means of contact for any questions or concerns of the website owner.

5.   Avoid fake ownership

The content scraped through web scraper should always be respected and never passed on under the fake information of scraper as the author. This act can be regarded as highly unethical as well as illegal since the website owner may file a copyright claim. It also damages the reputation of genuine web scrapers and hurts the trust of the website owner.

6.  Ask for permission

Since the website information belongs to the owner, one should never presume it to be free and ask politely to use it for their means. An ethical web scraper always seeks permission from the website owner to avoid any future problems. The website owner should be given the choice of whether she agrees to scrape the data.

 7.  Give due credit

To encourage the website owner as a token of thanks, the web scraper should give due credit wherever possible. This can be done in many ways such as providing a link to the original website on any blog, article, or social media post by generating traffic for the original website.

Ethical web scraping

Conclusion

Ethical web scraping is a two-way street in which the website owner should be mindful of the global availability of the data, similarly, the scraper should not harm the website in any way and also first seek permission from the website owner. If a web scraper abides by the above-mentioned practices, I.e., he/she works ethically, the web owner may not only allow scraping his or her website but also provide helpful means to the scraper in the form of Meta data or a public API.

At Data Science Dojo, we deliver data science education, consulting, and technical services to increase the power of data. We are therefore adding a free Jupyter Notebook Environment dedicated specifically for Ethical Web Scraping using Python. Install the Jupyter Hub offer now from the Azure Marketplace by Data Science Dojo, your ideal companion in your journey to learn data science!

Try now - CTA

Related Topics

Statistics
Resources
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
Artificial Intelligence