fbpx
Learn to build large language model applications: vector databases, langchain, fine tuning and prompt engineering. Learn more

Natural Language Processing – Tasks and techniques

Data Science Dojo
Muhammad Fahad Alam

November 7

This blog discusses the different tasks and techniques used in natural language processing. We will be using python code to demo what and how each task works. We will also discuss why these tasks and techniques are essential for natural language processing. 

 

Introduction

According to a survey, only 32 percent of the business data is put to work, and 68 percent goes unleveraged. Most data are often unstructured. According to estimations, 80 to 90 percent of business data is unstructured, and so are emails, reports, social media posts, websites, and documents. Using NLP techniques, it became possible for machines to manage and analyze unstructured data accurately and quickly.  

Computers can now understand, manipulate, and interpret human language. Businesses use NLP to improve customer experience, listen to customer feedback, and find market gaps. Almost 50% of companies today use NLP applications, and 25% plan to do so in 12 months.   

The future of customer care is NLP. Customers prefer mobile messaging and chatbots over the legacy voice channel. It is four times more accurate. According to the IBM market survey, 52% of global IT professionals reported using or planning to use NLP to improve customer experience. Chatbots can resolve 80% of routine tasks and customer questions with a 90% success rate by 2022. Estimates show that using NLP in chatbots will save companies USD 8 billion annually.     

The NLP market was at 3 billion US dollars in 2017 and is predicted to rise to 43 billion US dollars in 2025, around 14 times higher. 

 

Natural Language Processing (NLP)  

Natural language processing is a branch of artificial intelligence that enables computers to analyze, understand, and drive meaning from a human language using machine learning and respond to it. NLP combines computational linguistics with artificial intelligence and machine learning to create an intelligent system capable of understanding and responding to text or voice data the same way humans do. 

 

NLP analyzes the syntax and semantics of the text to understand the meaning and structure of human language. Then it transforms this linguistic knowledge into a machine-learning algorithm to solve real-world problems and perform specific tasks.   

Natural language is challenging to comprehend, which makes NLP a challenging task. Mastering a language is easy for humans, but implementing NLP becomes difficult for machines because of the ambiguity and imprecision of natural language. 

 

NLP requires syntactic and semantic analysis to convert human language into a machine-readable form that can be processed and interpreted. 

 

Syntactic analysis  

Syntactic analysis is the process of analyzing language with its formal grammatical rules. It is also known as syntax analysis or parsing formal grammatical rules applied to a group of words but not a single word. After verifying the correct syntax, it takes text data as input and creates a structural input representation. It creates a parse tree. A syntactically correct sentence does not necessarily make sense. It needs to be semantically correct to make sense.   

 

Semantic analysis  

Semantic analysis is the process of figuring out the meaning of the text. It enables computers to interpret the words by analyzing sentence structure and the relationship between individual words of the sentence. Because of language’s ambiguous and polysemic nature, semantic analysis is a particularly challenging area of NLP. It analyzes the sentence structure, word interaction, and other aspects to discover the meaning and topic of the text.  

 

NLP tasks and techniques: 

Before proceeding further, ensure you run the below code block to install all the dependencies. 

 

!pip install -U spacy 

!python -m spacy download en 

!pip install nltk 

!pip install prettytable 

Here are some everyday tasks performed in syntactic and semantic analysis:  

 

Tokenization  

Tokenization is a common task in NLP. It separates natural language text into smaller units called tokens. For example, in Sentence tokenization paragraph separates into sentences, and word tokenization splits the words of a sentence.  

 

The code below shows an example of word tokenization using spaCy.   

 

Code:  

import spacy 

nlp = spacy.load("en_core_web_sm") 

doc = nlp("Data Science Dojo is the leading platform providing data science training.") 

for token in doc: 

    print(token.text) 

 

Output: 

 

Data 

Science 

Dojo 

is 

the 

leading 

platform 

providing 

data 

science 

training 

. 

Part-of-speech tagging  

Part of speech or grammatical tagging labels each word as an appropriate part of speech based on its definition and context. POS tagging helps create a parse tree that helps understand word relationships. It also helps in Named Entity Recognition, as most named entities are nouns, making it easier to identify them. 

In the code below, we use pos_ attribute of the token to get the part of speech for the universal pos tag set.   

 

Code:  

import spacy 

from prettytable import PrettyTable 

table = PrettyTable(['Token', 'Part of speech', 'Tag']) 

nlp = spacy.load("en_core_web_sm") 

doc = nlp("Data Science Dojo is the leading platform providing data science training.") 

for token in doc: 

  table.add_row([token.text, token.pos_, token.tag_]) 

print(table) 

 

Output:    

Part of speech tag
Part of speech tag

Demo: 

Try it yourself with this Analyze Text Demo. 

Analyze Text
Analyze Text

 

Dependency and Consistency parsing  

Dependency parsing is how grammatical structure in a sentence is analyzed to find out the related word and their relationship. Each relationship has one head and one dependent. Then, a label based on the nature of dependency is assigned between the head and the dependent.  

Consistency parsing is a process by which phrase structure grammar is identified to visualize the entire syntactic structure.   

In the code below, we created a dependency tree using the displacy visualizer of spacy.  

 

Code:  

 

import spacy 

nlp = spacy.load("en_core_web_sm") 

doc = nlp("Data Science Dojo is the leading platform providing data science training.")         

spacy.displacy.render(doc, style="dep") 

 

Output:  

 output

 

Demo:  

Try it yourself with this Analyze Text Demo. 

 

Lemmatization and stemming  

We use inflected forms of the word when we speak or write. These inflected forms are created by adding prefixes or suffixes to the root form. In the process of lemmatization and stemming, we are grouping similar inflected forms of a word into a single root word. In this way, we link all the words with the same meaning as a single word, which is simpler to analyze by the computer.  

 

The word’s root form in lemmatization is lemma, and in stemming is a stem. Lemmatization and stemming do the same task of grouping inflected forms, but they are different. Lemmatization considers the word and its context in the sentence, while stemming only considers the single word. So, we consider POS tags in lemmatization but not in stemming. That is why lemma is an actual dictionary word, but stem might not be.  

Now we are applying lemmatization using spacy.   

Code:    

 

import spacy 

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner']) 

doc = nlp("Data Science Dojo is the leading platform providing data science training.") 

lemmatized = [token.lemma_ for token in doc] 

print("Original: \n", doc) 

print("\nAfter Lemmatization: \n", " ".join(lemmatized)) 

 

Output:   

Original 

 Data Science Dojo is the leading platform providing data science training. 

After Lemmatization:  

 Data Science Dojo is the lead platform to provide datum science training.  

 

Unfortunately, spacy does not contain any function for stemming.  

Let us use Porter Stemmer from nltk to see how stemming works.  

 

Code: 

import nltk 

nltk.download('punkt') 

from nltk.stem import PorterStemmer 

from nltk.tokenize import word_tokenize   

ps = PorterStemmer() 

sentence = "Data Science Dojo is the leading platform providing data science training." 

words = word_tokenize(sentence) 

stemmed = [ps.stem(token) for token in words]  

print("Original: \n", " ".join(words)) 

print("\nAfter Stemming: \n", " ".join(stemmed)) 

 

Output:    

Original:  

 Data Science Dojo is the leading platform providing data science training . 

After Stemming:  

 data scienc dojo is the lead platform provid data scienc train . 

 

Stop word removal  

Stop words are the frequent words that are used in any natural language. However, they are not particularly useful for text analysis and NLP tasks. Therefore, we remove them, as they do not play any role in defining the meaning of the text.   

 

Code: 

 

import spacy 

nlp = spacy.load("en_core_web_sm") 

doc = nlp("Data Science Dojo is the leading platform providing data science training.") 

token_list = [ token.text for token in doc ] 

filtered_sentence = [ word for word in token_list if nlp.vocab[word].is_stop == False ]  

print("Tokens:\n",token_list) 

print("\nAfter stop word removal:\n", filtered_sentence)    

 

Output: 

 

Tokens: 

['Data', 'Science', 'Dojo', 'is', 'the', 'leading', 'platform', 'providing', 'data', 'science', 'training', '.'] 

 

After stop word removal: 

['Data', 'Science', 'Dojo', 'leading', 'platform', 'providing', 'data', 'science', 'training', '.'] 

 

Demo: 

Try it yourself with this Cleanse Stop Words Demo. 

Cleanse Stop Word Demo
Cleanse Stop Word Demo

 

Named entity recognition  

Named entity recognition is an NLP technique that extracts named entities from the text and categorizes them into semantic types like organization, people, quantity, percentage, location, time, etc. Identifying named entities helps identify the critical element in the text, which can help sort the unstructured data and find valuable information.   

 

Code: 

 

import spacy 

from prettytable import PrettyTable 

nlp = spacy.load("en_core_web_sm") 

doc = nlp("Data Science Dojo was founded in 2013 but it was a free Meetup group long before the official launch. With the aim to bring the knowledge of data science to everyone, we started hosting short Bootcamps with the most comprehensive curriculum. In 2019, the University of New Mexico (UNM) added our Data Science Bootcamp to their continuing education department. Since then, we've launched various other trainings such as Python for Data Science, Data Science for Managers and Business Leaders. So far, we have provided our services to more than 10,000 individuals and over 2000 organizations.") 

table = PrettyTable(["Entity", "Start Position", "End Position", "Label"]) 

for ent in doc.ents: 

    table.add_row([ent.text, ent.start_char, ent.end_char, ent.label_]) 

print(table) 

spacy.displacy.render(doc, style="ent") 

 

Output:   

 

Named Entity
Named Entity

Visualization:   

 

Named Entity Visual
Named Entity Visual

 

Demo: 

Try it yourself with this Text Entity Extractor Demo. 

 

Text Entity Extractor Demo
Text Entity Extractor Demo

 

Sentiment analysis 

Sentiment analysis, also referred to as opinion mining, uses natural language processing to find and extract sentiments from the text. It determines whether the data is positive, negative, or neutral.  

 

Some of the real-world applications of sentiment analysis are:  

  • Customer support  
  • Customer feedback  
  • Brand monitoring  
  • Product analysis  
  • Market research  

 

Demo: 

Try it yourself with this Opinion Mining Demo. 

 

Opinion Mining Demo
Opinion Mining Demo

Conclusion:  

We have discussed natural language processing and what common tasks it performs in natural language processing. Then, we saw how we can perform different functions in spacy and nltk and why they are essential in natural language processing.   

Full Code Available 

 We know about the different tasks and techniques we perform in natural language processing, but we have yet to discuss the applications of natural language processing. For that, you can follow this blog. 

Read more about: 

Blog: NLP Applications

 

Upgrade your data science skillset with our Python for Data Science and Data Science Bootcamp training!  

 

DSD Sign
Written by Muhammad Fahad Alam
Interested in writing for us? Apply here: Submit your guest post with us
Newsletters | Data Science Dojo
Up for a Weekly Dose of Data Science?

Subscribe to our weekly newsletter & stay up-to-date with current data science news, blogs, and resources.

Data Science Dojo | data science for everyone

Discover more from Data Science Dojo

Subscribe to get the latest updates on AI, Data Science, LLMs, and Machine Learning.