Text Normalization
CLASS X - ARTIFICIAL INTELLIGENCE
CBSE CLASS X – Artificial Intelligence
Q. Normalise the text on the segmented sentences given below:
Document 1: Diya and Riya are best friends.
Document 2: Diya likes to play guitar but Riya prefers to play violin
Q. Sushmitha, a student of class X, was exploring the Natural Language Processing domain. She got stuck while performing the text normalization. Help her to normalize the text on the segmented sentences given below:
Document 1: Akash and Ajay are best friends.
Document 2: Akash likes to play football but Ajay prefers to play online games.
Corpus
In Text Normalization, A corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting. A corpus can be defined as a collection of entire text of all documents in a dataset.
Text Normalization
It is a process to reduce the variations in text’s word forms to a common form when the variation means the same thing.
The different in text normalization is
1. Sentence Segmentation
2. Tokenisation
3. Removing Stop words, Special characters and Numbers
4. Converting text to a common case
5. Stemming and Lemmatization
Q. Normalize the given text and comment on the vocabulary before and after the normalization:
Raj and Vijay are best friends. They play together with other friends. Raj likes to play football but Vijay prefers to play online games. Raj wants to be a footballer. Vijay wants to become an online gamer.
ANS -
1. Sentence Segmentation:
Under sentence segmentation, the whole text is divided into individual sentences.
1. Raj and Vijay are best friends.
2. They play together with other friends.
3. Raj likes to play football but Vijay prefers to play online games.
4. Raj wants to be a footballer.
5. Vijay wants to become an online gamer.
2. Tokenisation
Under tokenisation, every word, number and special character is considered separately and each of them is now a separate token.
3. Removing Stop words, Special characters and Numbers
Stopwords are the words which occur very frequently in the corpus but do not add much meaning to it.
eg. - a, an, and, are, as, for, it, is, into, in, if, on, or, such, the, this, there, to etc.
Hence, to make it easier for the computer to focus on meaningful terms, these words are removed.
3. Removing Stop words, Special characters and Numbers
1. Raj and Vijay are best friends.
Raj, Vijay, best, friends
2. They play together with other friends.
Play, together, other, friends
3. Raj likes to play football but Vijay prefers to play online games.
Raj, likes, play, football, Vijay, prefers, play, online, games
4. Raj wants to be a footballer.
Raj, wants, footballer
5. Vijay wants to become an online gamer.
Vijay, wants, become, online, gamer
4. Converting text to a common case
After the stop words removal, we convert the whole text into a similar case, preferably lower case.
1. raj, vijay, best, friends.
2. play, together, other, friends.
3. raj, likes, play, football, vijay, prefers, play, online, games.
4. raj, wants, footballer.
5. vijay, wants, become, online, gamer.
5. Stemming and Lemmatization
stemming is the process in which the affixes of words are removed, and the words are converted to their base form.
5. Stemming and Lemmatization
stemming is the process in which the affixes of words are removed, and the words are converted to their base form.
1. raj, vijay, best, friend
2. play, together, other, friend
3. raj, like, play, football, vijay, prefer, play, online, game
4. raj, want, footballer
5. vijay, want, become, online, gamer
No comments:
Post a Comment