Thursday, September 25, 2025

CLASS X - TEXT NORMALIZATION (ARTIFICIAL INTELLIGENCE)

Text Normalization

 CLASS X - ARTIFICIAL INTELLIGENCE

 

 CBSE CLASS X – Artificial Intelligence

 

Q. Normalise the text on the segmented sentences given below:

Document 1: Diya and Riya are best friends.

Document 2: Diya likes to play guitar but Riya prefers to play violin

 

Q. Sushmitha, a student of class X, was exploring the Natural Language Processing domain. She got stuck while performing the text normalization. Help her to normalize the text on the segmented sentences given below:

Document 1: Akash and Ajay are best friends.

Document 2: Akash likes to play football but Ajay prefers to play online games.

 

Corpus

In Text Normalization, A corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting.  A corpus can be defined as a collection of entire text of all documents in a dataset.

 

Text Normalization

It is a process to reduce the variations in text’s word forms to a common form when the variation means the same thing.

The different in text normalization is

1. Sentence Segmentation

2. Tokenisation

3. Removing Stop words, Special characters and Numbers

4. Converting text to a common case

5. Stemming  and Lemmatization

 

 

Q. Normalize the given text and comment on the vocabulary before and after the normalization:

Raj and Vijay are best friends. They play together with other friends. Raj likes to play football but Vijay prefers to play online games. Raj wants to be a footballer. Vijay wants to become an online gamer.

ANS -

1.    Sentence Segmentation:

Under sentence segmentation, the whole text is divided into individual sentences.

1. Raj and Vijay are best friends.

2. They play together with other friends.

3. Raj likes to play football but Vijay prefers to play online games.

4. Raj wants to be a footballer.

5. Vijay wants to become an online gamer.

 

2. Tokenisation

Under tokenisation, every word, number and special character is considered separately and each of them is now a separate token.

3. Removing Stop words, Special characters and Numbers

Stopwords are the words which occur very frequently in the corpus but do not add much meaning  to it.

eg. - a, an, and, are, as, for, it, is, into, in, if, on, or, such, the, this, there, to etc.

Hence, to make it easier for the computer to focus on meaningful terms, these words are removed.

 

3. Removing Stop words, Special characters and Numbers

1. Raj and Vijay are best friends.

    Raj, Vijay, best, friends

2. They play together with other friends.

    Play, together, other, friends

3. Raj likes to play football but Vijay prefers to play online games.

    Raj, likes, play, football, Vijay, prefers, play, online, games

4. Raj wants to be a footballer.

    Raj, wants, footballer

5. Vijay wants to become an online gamer.

    Vijay, wants, become, online, gamer

 

4. Converting text to a common case

After the stop words removal, we convert the whole text into a similar case, preferably lower case.

1. raj, vijay, best, friends.

2. play, together, other, friends.

3. raj, likes, play, football, vijay, prefers, play, online, games.

4. raj, wants, footballer.

5. vijay, wants, become, online, gamer.

 

5. Stemming and Lemmatization

stemming is the process in which the affixes of words are removed, and the words are converted to their base form.

 

5. Stemming and Lemmatization

stemming is the process in which the affixes of words are removed, and the words are converted to their base form.

1. raj, vijay, best, friend

2. play, together, other, friend

3. raj, like, play, football, vijay, prefer, play, online, game

4. raj, want, footballer

5. vijay, want, become, online, gamer

 

 

 

 

 

 

 


 

No comments:

Post a Comment