Natural Language Processing
The part you would want to see 😎
We can’t deny that Machines are so great when it comes to dealing with database tables and spreadsheets. But did you know that we have almost 80% of data that are so important to make decisions and find solutions but we can’t take advantage of.. (for now). Because they are unstructured.
The top use of NLP you can find in 🤖 Chatbots ( to keep you company or answer your queries), 📈Sentiment Analysis( to evaluate customer satisfaction, and for the company to ameliorate their business ),🎤 Voice and Speech Recognition (Siri Cortana, Google Home), 🈯Machine Translation( Google Translate).
Natural Language Processing, (NLP), is the sub-field of AI. NPL step in to fix our faced problem. And make machines understand and process human languages.
Okay? Cool now how does this work?
To get to the point where machines can understand what we are saying, we need to teach our computer the most basic concepts of written language and then we can move up from there.
Let’s dive into building an NLP Pipeline
Step 1: Sentence Segmentation
Segmentation is the process of splitting a text into sentences. In other words, deciding where sentences begin and end.
It will be easier if we assume that each sentence is a separate thought or idea. Why? Because it will be close to perfect if we write a program to understand a single sentence instead of a big paragraph.
Text = “ Hello world. Let’s try segmenting.”
Segment 1: “Hello world.”
Segment 2: “Let’s try segmenting.”
Keep in mind that segmentation can be as simple as we just saw like, split whenever you see a punctuation mark. But modern NLP got some great methods and techniques that are activated even when the document isn’t that clean.
Step 2: Word Tokenization
Don’t you think it is fair that after breaking the paragraph into sentences, that we break those sentences into words / Tokens?
Check this out, from our previous example :
Segment 1: “Hello world.”
After Tokenization: “Hello”, “World” , ”.”
Easy, after seeing a white space just split. Oh yeah, you must have noticed I mentioned also the “.” as a Token, of course, I will, it has a meaning.
Step 3: Parts of Speech for each single Token
Knowing the role of each word in the sentence will give us a great hint of what the sentence is talking about right?
So we will look at each Token and try to guess its part of the speech like is it a Noun, Verb, Adjective, …? Hmm, How is that possible !! yeah, it’s possible don’t worry; we will feed our lovely monster our Tokens, That lovely monster is a Pre-trained part of a speech Classification model.
Bare in mind that models are based on Stats, Still, nothing we say can be understood in the way we say it, it is just guessing a part of speech based on similar words and sentences already seen before.
Step 4: Lemmatization
You see these words: Affect, Affection, Affecting,…They got the different derivation but not the same meaning, you know it, but the computer got no idea about it, that is why it is better to make computer life easier by putting it in the base form. And for the Verb, it should be putting in their root, unconjugated form.
In NLP, we call this process lemmatization — figuring out the most basic form or lemma of each word in the sentence.
Step 5: Identifying Stop Words
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
We would not want these words to take up space in our database, or taking up the valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 23 different languages. You can find them in the nltk_data directory.
Step 6: Dependency Parsing — Part I
Is to determine how all the words in our sentence relate to each other.
The goal is to build a tree that assigns a single parent word to each word in the sentence. By that, we mean that the root of the tree will be the main verb in the sentence.
As you see we can also predict the type of relationship that exists between those words in the sentence.
Step 6: Dependency Parsing — Part II
A cooler idea in our situation that instead of treating each word of our sentence as a separate entity we can group together the words which represent the same thing. So instead of More (Adverb) attractive (Adjectives), It becomes like this: More attractive (Noun).
Step 7: Named Entity Recognition
The main goal of NER (Named Entity Recognition) is to detect and label the nouns that exist in the sentence with real-world representation.
Like Company names, Places, People, etc.
Step 8: Coreference Resolution
Everything is going great for us, but one small situation we need to fix Computers are not smart enough to figure out that pronouns refer to who? in the sentence.
Tunisia is the best place, it has a lot of beautiful places.
As a normal user after reading the sentence above it will be too easy for you to relate that it refers to Tunisia. But not for the computer..But after running coreference resolution on our document combined with the parse tree and NER, It will do such a great job in extracting a lot of information.
So to wrap up, it is amazing to take advantage of those wasted valuable data, and as we discussed that for building a successful NLP pipeline you should follow 8 steps, starting from Sentence Segmentation where you split your paragraph into sentences, and then comes Word Tokenization where you split also those sentences into words.
Then will be great if we could identify each part of speech is it a noun is it a verb? using a pre-trained POS classification model. We use Lemmatization to put the derivated token into the root or basic form, and we identify the stop words so we eliminate the waste. without forgetting to apply Dependency Parsing where we determinate a root (the verb) and how the words in the sentence relate to and combine them together if it has the same meaning. Last but not least is to use the NER to make it easier to identify the nouns and finally the Coreference Resolution where it is easy to know what the pronouns refer to.
Now if you want to practice I recommend you of using NLTK or SpaCy, and I’ll be putting how in detail in the next article.
Ahed Bahri— Computer Engineering Student, Microsoft Learn Student Ambassador & The President of Microsoft Polytechnique Club Sousse.